Senior Site Reliability Engineer - GPU Infrastructure

Job not on LinkedIn

June 17

Apply Now
Logo of Genmo

Genmo

Artificial Intelligence • Media

Genmo is a company focusing on the development of cutting-edge AI video generation models. Their product, Mochi 1, is an open source video generation model that sets new standards in motion quality and realistic simulation according to the laws of physics. Genmo aims to solve fundamental problems in AI video technology, providing superior control over characters and settings through textual prompts. Mochi 1 is designed to create fluid, human-like actions and expressions, thereby advancing the capabilities of AI-generated video content. The company invites talent to join their team in building state-of-the-art open video models, highlighting roles for researchers, scientists, and engineers.

📋 Description

• Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models. • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation. • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux. • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes. • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM. • Optimize high‑performance networking (InfiniBand/RDMA) and debug perf bottlenecks. • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews. • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

🎯 Requirements

• 3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets. • Expert-level Kubernetes experience. • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible). • Track record of shipping and operating large-scale infrastructure with high reliability and clear communication. • Multi-cluster / multi-cloud (AWS, GCP, Azure, bare-metal) production experience. • Hands-on with containerized GPU stacks (nvidia-container-toolkit, GPU Operator) • GPU schedulers such as Slurm or Kueue. • Familiarity with CI/CD tooling (GitHub Actions, BuildKit). • Prior work with distributed training, model-serving patterns, or other ML/GPU workloads.

Apply Now

Similar Jobs

June 11

Nava seeks experienced infrastructure engineers for AWS systems management and improvement.

AWS

Cloud

Docker

JavaScript

Jenkins

Linux

Packer

Python

Ruby

Terraform

Unix

June 8

Enthuziastic

11 - 50

DevOps Trainers training learners worldwide in technologies like Docker, Kubernetes and CI/CD. Seeking individuals with extensive technical training experience.

Ansible

AWS

Azure

Chef

Cloud

Docker

Gradle

Grafana

Java

JavaScript

Jenkins

Kubernetes

Linux

Maven

Prometheus

Puppet

Python

Shell Scripting

Terraform

Vagrant

June 4

Join Provation as a Senior DevOps Engineer, focusing on Azure infrastructure and CI/CD processes.

AWS

Azure

Cloud

Distributed Systems

Docker

Google Cloud Platform

Grafana

Jenkins

Kubernetes

Microservices

MySQL

Postgres

Prometheus

Python

RabbitMQ

Splunk

SQL

Terraform

Vault

May 19

Lead Azure DevOps Engineer required to implement and support Azure DevOps solutions and cloud services.

Ansible

AWS

Azure

Chef

Cloud

Docker

Google Cloud Platform

Groovy

Jenkins

Kubernetes

Linux

Puppet

Python

Terraform

TFS

May 19

Manage AWS resources and CI/CD processes for Clinical Ink. Guide strategic IT planning and partnerships.

AWS

Cloud

Docker

Kubernetes

Python

RTOS

Terraform

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com