Senior Site Reliability Engineer

Job not on LinkedIn

October 1

Apply Now
Logo of The Voleon Group

The Voleon Group

Finance • Artificial Intelligence

The Voleon Group is a company focused on the development and application of advanced machine learning technologies for investment management. By leveraging statistical algorithms and data-driven techniques, Voleon aims to improve financial prediction and management practices. Established in 2007 and headquartered near the University of California, Berkeley, the company benefits from a strong academic environment. Voleon's team consists of top talents in statistics, computer science, and related fields, fostering innovation in a collaborative work culture. The leadership consists of highly educated individuals with a background in computer science and statistics, emphasizing scalability and risk management in their investment strategies.

51 - 200 employees

Founded 2007

💸 Finance

🤖 Artificial Intelligence

📋 Description

• Help scale research compute cluster to meet growing needs. • Leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. • Responsible for keeping research clusters available and performant. • Provide a world-class HPC platform for researchers focusing on machine learning problems at scale. • Support both on-prem and cloud infrastructure, ensuring best experiences for technical staff. • Collaborate with engineering teams to develop monitoring and telemetry improvements. • Design and oversee operational frameworks to ensure cluster operations meet SLAs.

🎯 Requirements

• 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead. • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod). • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.) • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible). • Experience with cloud infrastructure (AWS or GCP). • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry). • Experience with distributed storage technologies (Lustre, Ceph, S3). • Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation. • Bachelor degree in computer science or equivalent experience.

🏖️ Benefits

• medical, dental and vision coverage • life and AD&D insurance • 20 days of paid time off • 9 sick days • 401(k) plan with a company match • “Friends of Voleon” Candidate Referral Program

Apply Now

Similar Jobs

October 1

Domyn

51 - 200

🤖 Artificial Intelligence

💳 Fintech

⚕️ Healthcare Insurance

Senior DevOps Engineer at Domyn managing cloud and on-prem infrastructure for enterprise AI. Optimize deployments across GCP, Azure, AWS and ensure security, reliability, and high availability.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

September 30

Mission Box Solutions

11 - 50

👥 HR Tech

🎯 Recruiter

⚕️ Healthcare Insurance

Talent-pool for DevOps-specialist roles at Mission Box Solutions. Connecting veteran-owned recruiting agency candidates with hiring companies across DevOps specializations.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

September 30

Cutsforth Inc.

11 - 50

⚡ Energy

🔧 Hardware

🏢 Enterprise

DevOps Engineer building and operating application servers and IaC for Cutsforth's power-generation monitoring systems. Supports customers, deployments, cybersecurity, and LabVIEW-integrated solutions.

🇺🇸 United States – Remote

💵 $103k - $148k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

September 30

CrowdStrike

5001 - 10000

🔒 Cybersecurity

☁️ SaaS

🤖 Artificial Intelligence

Senior DevOps Architect designing scalable, secure AWS/Kubernetes infrastructure and CI/CD for CrowdStrike's AI-native cybersecurity platform.

September 28

Unqork

201 - 500

☁️ SaaS

🏢 Enterprise

🤖 Artificial Intelligence

Site Reliability Engineer building and automating Unqork's enterprise low-code platform. Improve reliability through SLOs, monitoring, and automation.

🇺🇸 United States – Remote

💰 Venture Round on 2021-01

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com