Senior Site Reliability Engineer – AI Infrastructure

Job not on LinkedIn

🕒 April 9

🏄 California – Remote

info

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

info
Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Andromeda

Andromeda

11 - 50 employees

🤖 Artificial Intelligence

🤝 B2B

🔧 Hardware

🔥 Funding within the last year

💰 $15.1M Series A - Andromeda Robotics on 2025-09

Artificial Intelligence • B2B • Hardware

Andromeda is a GPU compute service and marketplace offering instant access to large clusters of H100, H200, and B200 accelerators for experiments, full-scale training, and inference. It supports orchestration with Slurm, Kubernetes, or direct SSH, provides flexible, no-minimum-duration usage and competitive pricing, and includes DevOps expertise, local NAS or streamed storage with no ingress/egress fees, and 24/7 support with industry SLAs. The company also operates a third-party GPU marketplace at gpulist. ai.

📋 Description

• Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training • Serve as the primary technical point of contact for customers running large-scale training workloads • Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure • Ensure the health and performance of high-speed interconnects • Build deep visibility into GPU utilization, memory pressure, interconnect throughput • Build production-grade automation for cluster provisioning, GPU health checks, job scheduling • Lead incident response for complex failures spanning hardware, networking, orchestration

🎯 Requirements

• Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent) • Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training • Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar • Expert-level Linux knowledge • Strong experience running Kubernetes in production with GPU workloads • Strong engineering skills in Python, Go, or Bash • Hands-on experience building monitoring and alerting for GPU infrastructure • Proven track record leading incident response for complex distributed systems

🏖️ Benefits

• Health insurance • Retirement plans • Paid time off • Flexible work arrangements • Professional development

Apply Now

Similar Jobs

🕒 April 9

PostHog

11 - 50

☁️ SaaS

⚡ Productivity

🏢 Enterprise

SRE role focusing on turning fast-growing systems into predictable, reliable platforms. Join PostHog to build and automate infrastructure.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 April 9

Cresta

51 - 200

☁️ SaaS

🤖 Artificial Intelligence

🏢 Enterprise

Senior Infrastructure Engineer/SRE responsible for building core infrastructure at AI-driven contact center company. Designing tools for developers and ensuring reliability across cloud platforms.

🕒 April 9

Toast

1001 - 5000

☁️ SaaS

🤝 B2B

Senior Software Engineer focusing on Mobile DevOps at Toast, creating innovative solutions for restaurant technology with a strong emphasis on AI tools and developer experience.

🕒 April 9

Alteryx

1001 - 5000

🤖 Artificial Intelligence

🤝 B2B

Lead Site Reliability Engineer guiding reliability strategy and execution for modern multi-region SaaS platform. Focused on system design, incident management, and cross-team collaboration.

🕒 April 8

Toast

1001 - 5000

☁️ SaaS

🤝 B2B

Staff Software Engineer, Tech Lead focused on mobile DevOps at Toast, specializing in Android development and CI/CD processes for restaurant technology.