
11 - 50 employees
🤖 Artificial Intelligence
Artificial Intelligence • Cloud Computing
FluidStack is a company that provides GPU supercomputing infrastructure for AI labs. It offers on-demand access to thousands of Nvidia GPUs, enabling large-scale AI training and inference. The company specializes in deploying and managing large GPU clusters with support for technologies like Kubernetes and Slurm, ensuring high availability and excellent support. FluidStack provides a fully managed cloud infrastructure, helping AI companies to focus on developing models without worrying about the underlying hardware. They emphasize performance and cost-efficiency, offering services that scale to thousands of GPUs with high uptime and rapid response times.
🔥 0 minutes ago
🇺🇸 United States – Remote
💵 $150k - $250k / year
⏰ Full Time
🔴 Lead
⛑ DevOps & Site Reliability Engineer (SRE)
Improve your chances of getting an interview by checking your resume score before you apply.

11 - 50 employees
🤖 Artificial Intelligence
Artificial Intelligence • Cloud Computing
FluidStack is a company that provides GPU supercomputing infrastructure for AI labs. It offers on-demand access to thousands of Nvidia GPUs, enabling large-scale AI training and inference. The company specializes in deploying and managing large GPU clusters with support for technologies like Kubernetes and Slurm, ensuring high availability and excellent support. FluidStack provides a fully managed cloud infrastructure, helping AI companies to focus on developing models without worrying about the underlying hardware. They emphasize performance and cost-efficiency, offering services that scale to thousands of GPUs with high uptime and rapid response times.
• Take the on-call escalation when a site hits trouble and triage it virtually, using real knowledge of the team and the systems to decide what to escalate, when, and how to keep the field crew focused without burying them. • Get on a plane when it matters: travel site to site (50%+) to work live incidents and post-incident reviews on the floor, and bring the practices that worked elsewhere with you. • Own root cause analysis on significant events through to closure and track corrective actions to done, killing the underlying class of failure rather than the one instance in front of you. • Read the patterns across the fleet’s incidents and RCAs, push the few highest-value learnings through to closure, and stay honest about what’s achievable and what to drop instead of boiling the ocean. • Carry learnings and practices from one campus to the next so a fix at one site becomes the standard everywhere before the failure repeats. • Write the operational Assessment standard and audit each campus against it, feeding what you find straight back into the corrective-action loop.
• You’ve run a live critical operation and led a team of operators, and you carry the deep, earned judgment that comes from owning the floor when it counts. • You’ve been the person a site calls when something breaks, triaged the problem over the phone, and known exactly when to escalate and when to let the field team work it. • You’ve authored root cause analyses on significant events and tracked corrective actions to closure, and you can show the difference between an RCA that closed a ticket and one that killed a class of failure. • You’ve sat with a pile of RCA actions and cut it to the few that matter, because you know an operation that commits to everything finishes nothing. • You’ve traveled site to site, walked the floor, and left each operation better than you found it, carrying the practices that worked from one into the next. • You’ve written the standard, not just followed it, audited real sites against it without flinching from what you found, and can hold one bar across domains you don’t all live in. • Bonus: Hyperscale or large colocation at hundreds of MW+. Direct exposure to Hardware or Network operations, not only Facilities, incident.io or equivalent incident tooling, plus DCIM. Building an assessment, audit, qualification, or training program from scratch.
• Competitive total compensation package (salary + equity). • Retirement or pension plan, in line with local norms. • Health, dental, and vision insurance. • Generous PTO policy, in line with local norms.
Apply Now🕒 Yesterday
Site Reliability Engineer blending software engineering, automation, and operations expertise. Building scalable platforms and enabling high-velocity delivery for critical Defense systems.
🇺🇸 United States – Remote
💵 $164.4k - $215.1k / year
⏰ Full Time
🟠 Senior
🔴 Lead
⛑ DevOps & Site Reliability Engineer (SRE)
🦅 H1B Visa Sponsor
Cloud
Distributed Systems
Grafana
Kubernetes
Linux
Prometheus
Python
Splunk
🕒 3 days ago
Staff Software Engineer responsible for enhancing reliability and security in production environments. Collaborating on projects to scale systems at Coinbase.
🇺🇸 United States – Remote
💵 $218k - $256.5k / year
💰 $21.4M Post-IPO Equity on 2022-11
⏰ Full Time
🔴 Lead
⛑ DevOps & Site Reliability Engineer (SRE)
🦅 H1B Visa Sponsor
AWS
Azure
Cloud
Google Cloud Platform
Ruby
Terraform
Go
🕒 3 days ago
Cloud DevOps Engineer providing effective cloud solutions and system performance management. Involves hands-on development, support, and troubleshooting for large-scale cloud environments.
🇺🇸 United States – Remote
💵 $102k - $138k / year
⏰ Full Time
🟠 Senior
🔴 Lead
⛑ DevOps & Site Reliability Engineer (SRE)
🦅 H1B Visa Sponsor
Ansible
Azure
Cloud
Docker
Firewalls
Kubernetes
Linux
Microservices
Python
Switching
🕒 4 days ago
Senior DevOps/Observability Engineer leading the design of a unified observability platform. Focused on architecting a sophisticated observability pipeline leveraging AWS technologies.
🇺🇸 United States – Remote
💰 Series A on 2019-12
⏰ Full Time
🟠 Senior
🔴 Lead
⛑ DevOps & Site Reliability Engineer (SRE)
🦅 H1B Visa Sponsor
AWS
Grafana
Kubernetes
Prometheus
Splunk
Terraform
🕒 4 days ago
DevSecOps Engineer ensuring secure software development at Redox, enhancing healthcare data exchange. Collaborating with platform engineers to implement security best practices across the AWS/EKS infrastructure.
🇺🇸 United States – Remote
💵 $190k - $199k / year
⏰ Full Time
🔴 Lead
⛑ DevOps & Site Reliability Engineer (SRE)
🦅 H1B Visa Sponsor
AWS
Cloud
JavaScript
Kubernetes
Node.js
Python
Terraform
TypeScript
Go