Site Reliability Engineer

Artificial Intelligence • Cloud Computing

FluidStack is a company that provides GPU supercomputing infrastructure for AI labs. It offers on-demand access to thousands of Nvidia GPUs, enabling large-scale AI training and inference. The company specializes in deploying and managing large GPU clusters with support for technologies like Kubernetes and Slurm, ensuring high availability and excellent support. FluidStack provides a fully managed cloud infrastructure, helping AI companies to focus on developing models without worrying about the underlying hardware. They emphasize performance and cost-efficiency, offering services that scale to thousands of GPUs with high uptime and rapid response times.

11 - 50 employees

🤖 Artificial Intelligence

Site Reliability Engineer

July 11

🏄 California – Remote

⏰ Full Time

🟢 Junior

🟡 Mid-level

⛑ DevOps & Site Reliability Engineer (SRE)

🚫👨‍🎓 No degree required

Ansible

Cloud

Flash

Kubernetes

Open Source

Python

Terraform

Apply Now

FluidStack

Artificial Intelligence • Cloud Computing

11 - 50 employees

🤖 Artificial Intelligence

📋 Description

•SREs at Fluidstack sit at the core of our infrastructure, working across software, hardware, and operations to ensure the reliability and performance of our global GPU cloud. •They partner closely with teams including networking, platform engineering, and data center operations to build systems that scale with the demands of AI workloads. •SREs are hands-on and possess deep systems knowledge and strong communication skills. •A typical day may involve deploying clusters of 1,000+ GPUs using custom written playbooks; validating correctness and performance of underlying compute, storage, and networking infrastructure; migrating petabytes of data from public cloud platforms to local storage; debugging issues; building internal tooling to decrease deployment time and increase cluster reliability. •This role will involve being part of an on-call rotation up to one week per month.

🎯 Requirements

•2+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience. •Great verbal and written communication skills in English. •Experience deploying and operating Kubernetes and/or SLURM clusters. •Experience in writing Go, Python, Bash. •Experience using Ansible, Terraform, and other automation or IAC tools. •Strong engineering background, preferably in Computer Science, Software Engineering, Math, Computer Engineering, or similar fields.

🏖️ Benefits

•Competitive total compensation package (cash + equity). •Retirement or pension plan, in line with local norms. •Health, dental, and vision insurance. •Generous PTO policy, in line with local norms. •Fluidstack is remote first, but has offices in London, New York, and SF. For all other locations, we provide access to WeWork.

Apply Now

Similar Jobs

DevOps Engineer

July 11

TetraScience

51 - 200

🤖 Artificial Intelligence

🧬 Biotechnology

☁️ SaaS

Lead product lifecycle processes within TetraScience's AI and cloud solutions.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Cloud

Docker

Java

Kubernetes

Linux

Microservices

Python

Terraform

DevOps Engineer

July 11

Swish Analytics

11 - 50

🎲 Gambling

🎮 Gaming

⚽ Sports

Swish Analytics seeks a DevOps Engineer. Role involves managing Kubernetes for predictive sports analytics workloads.

🇺🇸 United States – Remote

💵 $120k - $190k / year

💰 $6.9M Series B on 2019-05

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Distributed Systems

Docker

EC2

Jenkins

Kubernetes

Python

Terraform

Site Reliability Engineering Manager

July 10

Wikimedia Foundation

501 - 1000

🤝 Non-profit

📚 Education

📱 Media

Lead globally distributed teams in Site Reliability Engineering at Wikimedia Foundation. Oversee infrastructure development and incident response.

🇺🇸 United States – Remote

💵 US$132.4k - US$208.4k / year

💰 $2.5M Grant on 2019-09

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

Cloud

Docker

Kubernetes

Linux

Open Source

Terraform

Site Reliability Engineering Manager

July 10

Wikimedia Foundation

501 - 1000

🤝 Non-profit

📚 Education

📱 Media

Manage and mentor SRE teams and enhance Wikimedia’s global infrastructure and services.

🇺🇸 United States – Remote

💵 US$132.4k - US$208.4k / year

💰 $2.5M Grant on 2019-09

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

Cloud

Docker

Kubernetes

Linux

Open Source

Terraform

Site Reliability Engineering Manager

July 10

Wikimedia Foundation

501 - 1000

🤝 Non-profit

📚 Education

📱 Media

Join Wikimedia Foundation as Engineering Manager to lead remote SRE teams supporting global infrastructure. Focus on guidance, project management, and incident response.

🇺🇸 United States – Remote

💵 $132.4k - $208.4k / year

💰 $2.5M Grant on 2019-09

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

Cloud

Docker

Kubernetes

Linux

Open Source

Terraform