Senior - Principal Site Reliability Engineer

DataCrunch.io is a fresh cloud service provider, our main focus is providing our own infrastructure for machine learning.

11 - 50 employees

💰 Pre Seed Round on 2021-11

Senior - Principal Site Reliability Engineer

October 29

🏄 California – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Cloud

Distributed Systems

DNS

Google Cloud Platform

Kubernetes

Linux

Python

Terraform

Apply Now

DataCrunch

DataCrunch.io is a fresh cloud service provider, our main focus is providing our own infrastructure for machine learning.

11 - 50 employees

💰 Pre Seed Round on 2021-11

📋 Description

• Ensure the reliability, scalability, and performance of HPC and cloud systems. • Build and maintain automation, observability, and monitoring frameworks for compute clusters. • Collaborate with ML, data, and infrastructure teams to deliver high-availability systems. • Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes. • Participate in architecture design and long-term infrastructure strategy discussions. • Help establish local infrastructure and contribute to the setup of our future San Francisco office. • Play a key role in recruiting and mentoring as our U.S. team grows.

🎯 Requirements

• 7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems. • Linux expertise (Ubuntu or Debian preferred). • Strong experience with scripting and automation (Python, Go, Bash). • Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius). • Deep understanding networking (DNS/TCP), and infrastructure-as-code tools (Terraform, Ansible). • Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs. • Familiarity with ML model training environments. • Understanding of Kubernetes (nice to have)

🏖️ Benefits

• Generous cash + equity compensation • Various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.)

Apply Now

Similar Jobs

Senior DevOps Engineer

October 29

UserTesting

501 - 1000

☁️ SaaS

🏢 Enterprise

🤝 B2B

Senior DevOps Engineer focused on cloud infrastructure for UserTesting, ensuring systems are fast and reliable. Collaborating with engineers to deliver exceptional developer experiences.

🇺🇸 United States – Remote

💰 Grant on 2020-11

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Grafana

Jenkins

Kubernetes

Prometheus

Python

Terraform

Site Reliability Engineer III

October 29

Stone & Company

2 - 10

Site Reliability Engineer developing and maintaining critical features for Stone Tech. Responsible for monitoring performance and ensuring reliability across systems.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

🗣️🇧🇷🇵🇹 Portuguese Required

AWS

GRPC

JavaScript

MongoDB

Postgres

SQL

DevOps Engineer

October 29

Lumos

51 - 200

🌐 Web 3

📋 Compliance

☁️ SaaS

DevOps Engineer enhancing and maintaining cloud infrastructure at fast-growing startup Lumos. Collaborates with development and operations teams for automation and scalability.

🇺🇸 United States – Remote

💵 $160k - $190k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Distributed Systems

Docker

Google Cloud Platform

Grafana

Jenkins

Kubernetes

Microservices

Prometheus

Python

Terraform

Vault

Site Reliability Engineer

October 29

Hydra Host

11 - 50

🔧 Hardware

🏢 Enterprise

🤖 Artificial Intelligence

Site Reliability Engineer ensuring high uptime and performance for cloud systems at Hydra Host. Collaborating with teams to integrate monitoring and QA tools for reliability and observability.

🇺🇸 United States – Remote

💵 $140k - $200k / year

💰 $10M Seed Round on 2022-04

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Grafana

Kubernetes

Prometheus

Python

Senior Site Reliability Engineer, BCM – DGX Cloud

October 28

NVIDIA

10,000+ employees

🤖 Artificial Intelligence

🎮 Gaming

Senior Site Reliability Engineer ensuring daily operations and incident handling for large scale GPU platforms at NVIDIA. Contributing to feature design and cluster validation for optimal performance and resilience.

🇺🇸 United States – Remote

💵 $168k - $333.5k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Kubernetes

Linux

Python