Post a Job Affiliates

Search Remote Jobs

Genmo

Website LinkedIn All Job Openings

Artificial Intelligence • Media

Genmo is a company focusing on the development of cutting-edge AI video generation models. Their product, Mochi 1, is an open source video generation model that sets new standards in motion quality and realistic simulation according to the laws of physics. Genmo aims to solve fundamental problems in AI video technology, providing superior control over characters and settings through textual prompts. Mochi 1 is designed to create fluid, human-like actions and expressions, thereby advancing the capabilities of AI-generated video content. The company invites talent to join their team in building state-of-the-art open video models, highlighting roles for researchers, scientists, and engineers.

2 - 10 employees

🤖 Artificial Intelligence

📱 Media

Senior Site Reliability Engineer - GPU Infrastructure

Job not on LinkedIn

June 17

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Cloud

Flux

Google Cloud Platform

Grafana

Kubernetes

Prometheus

Python

Terraform

Apply Now

Genmo

Website LinkedIn All Job Openings

Artificial Intelligence • Media

2 - 10 employees

🤖 Artificial Intelligence

📱 Media

📋 Description

• Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models. • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation. • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux. • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes. • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM. • Optimize high‑performance networking (InfiniBand/RDMA) and debug perf bottlenecks. • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews. • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

🎯 Requirements

• 3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets. • Expert-level Kubernetes experience. • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible). • Track record of shipping and operating large-scale infrastructure with high reliability and clear communication. • Multi-cluster / multi-cloud (AWS, GCP, Azure, bare-metal) production experience. • Hands-on with containerized GPU stacks (nvidia-container-toolkit, GPU Operator) • GPU schedulers such as Slurm or Kueue. • Familiarity with CI/CD tooling (GitHub Actions, BuildKit). • Prior work with distributed training, model-serving patterns, or other ML/GPU workloads.

Apply Now

Similar Jobs

DevOps Engineer

June 11

Nava

201 - 500

🏛️ Government

🤝 B2B

☁️ SaaS

Website LinkedIn All Job Openings

Nava seeks experienced infrastructure engineers for AWS systems management and improvement.

🇺🇸 United States – Remote

💵 $126k - $135.9k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Cloud

Docker

JavaScript

Jenkins

Linux

Packer

Python

Ruby

Terraform

Unix

Apply

View Job

DevOps Trainer – Technical Trainer

June 8

Enthuziastic

11 - 50

Website LinkedIn All Job Openings

DevOps Trainers training learners worldwide in technologies like Docker, Kubernetes and CI/CD. Seeking individuals with extensive technical training experience.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Chef

Cloud

Docker

Gradle