Senior - Principal Site Reliability Engineer

October 29

Apply Now
Logo of DataCrunch

DataCrunch

DataCrunch.io is a fresh cloud service provider, our main focus is providing our own infrastructure for machine learning.

11 - 50 employees

💰 Pre Seed Round on 2021-11

📋 Description

• Ensure the reliability, scalability, and performance of HPC and cloud systems. • Build and maintain automation, observability, and monitoring frameworks for compute clusters. • Collaborate with ML, data, and infrastructure teams to deliver high-availability systems. • Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes. • Participate in architecture design and long-term infrastructure strategy discussions. • Help establish local infrastructure and contribute to the setup of our future San Francisco office. • Play a key role in recruiting and mentoring as our U.S. team grows.

🎯 Requirements

• 7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems. • Linux expertise (Ubuntu or Debian preferred). • Strong experience with scripting and automation (Python, Go, Bash). • Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius). • Deep understanding networking (DNS/TCP), and infrastructure-as-code tools (Terraform, Ansible). • Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs. • Familiarity with ML model training environments. • Understanding of Kubernetes (nice to have)

🏖️ Benefits

• Generous cash + equity compensation • Various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.)

Apply Now

Similar Jobs

October 29

UserTesting

501 - 1000

☁️ SaaS

🏢 Enterprise

🤝 B2B

Senior DevOps Engineer focused on cloud infrastructure for UserTesting, ensuring systems are fast and reliable. Collaborating with engineers to deliver exceptional developer experiences.

🇺🇸 United States – Remote

💰 Grant on 2020-11

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

October 29

Site Reliability Engineer developing and maintaining critical features for Stone Tech. Responsible for monitoring performance and ensuring reliability across systems.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

🗣️🇧🇷🇵🇹 Portuguese Required

October 29

Lumos

51 - 200

🌐 Web 3

📋 Compliance

☁️ SaaS

DevOps Engineer enhancing and maintaining cloud infrastructure at fast-growing startup Lumos. Collaborates with development and operations teams for automation and scalability.

🇺🇸 United States – Remote

💵 $160k - $190k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

October 29

Hydra Host

11 - 50

🔧 Hardware

🏢 Enterprise

🤖 Artificial Intelligence

Site Reliability Engineer ensuring high uptime and performance for cloud systems at Hydra Host. Collaborating with teams to integrate monitoring and QA tools for reliability and observability.

🇺🇸 United States – Remote

💵 $140k - $200k / year

💰 $10M Seed Round on 2022-04

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

October 28

NVIDIA

10,000+ employees

🤖 Artificial Intelligence

🎮 Gaming

Senior Site Reliability Engineer ensuring daily operations and incident handling for large scale GPU platforms at NVIDIA. Contributing to feature design and cluster validation for optimal performance and resilience.

🇺🇸 United States – Remote

💵 $168k - $333.5k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com