Senior Site Reliability Engineer

Job not on LinkedIn

October 1

Apply Now
Logo of The Voleon Group

The Voleon Group

Finance • Artificial Intelligence

The Voleon Group is a company focused on the development and application of advanced machine learning technologies for investment management. By leveraging statistical algorithms and data-driven techniques, Voleon aims to improve financial prediction and management practices. Established in 2007 and headquartered near the University of California, Berkeley, the company benefits from a strong academic environment. Voleon's team consists of top talents in statistics, computer science, and related fields, fostering innovation in a collaborative work culture. The leadership consists of highly educated individuals with a background in computer science and statistics, emphasizing scalability and risk management in their investment strategies.

51 - 200 employees

Founded 2007

💸 Finance

🤖 Artificial Intelligence

📋 Description

• Help scale research compute cluster to meet growing needs. • Leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. • Responsible for keeping research clusters available and performant. • Provide a world-class HPC platform for researchers focusing on machine learning problems at scale. • Support both on-prem and cloud infrastructure, ensuring best experiences for technical staff. • Collaborate with engineering teams to develop monitoring and telemetry improvements. • Design and oversee operational frameworks to ensure cluster operations meet SLAs.

🎯 Requirements

• 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead. • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod). • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.) • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible). • Experience with cloud infrastructure (AWS or GCP). • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry). • Experience with distributed storage technologies (Lustre, Ceph, S3). • Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation. • Bachelor degree in computer science or equivalent experience.

🏖️ Benefits

• medical, dental and vision coverage • life and AD&D insurance • 20 days of paid time off • 9 sick days • 401(k) plan with a company match • “Friends of Voleon” Candidate Referral Program

Apply Now

Similar Jobs

October 1

Senior DevOps Engineer at Domyn managing cloud and on-prem infrastructure for enterprise AI. Optimize deployments across GCP, Azure, AWS and ensure security, reliability, and high availability.

AWS

Azure

Cloud

Docker

Google Cloud Platform

Java

JavaScript

Kubernetes

Linux

Postgres

Python

Terraform

September 30

Talent-pool for DevOps-specialist roles at Mission Box Solutions. Connecting veteran-owned recruiting agency candidates with hiring companies across DevOps specializations.

September 30

DevOps Engineer building and operating application servers and IaC for Cutsforth's power-generation monitoring systems. Supports customers, deployments, cybersecurity, and LabVIEW-integrated solutions.

Cloud

Cyber Security

Terraform

September 30

Senior DevOps Architect designing scalable, secure AWS/Kubernetes infrastructure and CI/CD for CrowdStrike's AI-native cybersecurity platform.

AWS

Cloud

Cyber Security

Distributed Systems

Google Cloud Platform

Kubernetes

Terraform

September 28

Site Reliability Engineer building and automating Unqork's enterprise low-code platform. Improve reliability through SLOs, monitoring, and automation.

Ansible

AWS

Azure

Chef

Cloud

Google Cloud Platform

Grafana

JavaScript

Kubernetes

Linux

MongoDB

MySQL

Node.js

Oracle

Postgres

Puppet

Python

SaltStack

Splunk

Terraform

Go

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com