Staff SRE, AI Infrastructure

Job not on LinkedIn

🕒 May 21

🏄 California – Remote

info

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

info
Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Andromeda

Andromeda

11 - 50 employees

🤖 Artificial Intelligence

🤝 B2B

🔧 Hardware

🔥 Funding within the last year

💰 $15.1M Series A - Andromeda Robotics on 2025-09

Artificial Intelligence • B2B • Hardware

Andromeda is a GPU compute service and marketplace offering instant access to large clusters of H100, H200, and B200 accelerators for experiments, full-scale training, and inference. It supports orchestration with Slurm, Kubernetes, or direct SSH, provides flexible, no-minimum-duration usage and competitive pricing, and includes DevOps expertise, local NAS or streamed storage with no ingress/egress fees, and 24/7 support with industry SLAs. The company also operates a third-party GPU marketplace at gpulist. ai.

📋 Description

• Own the reliability of Andromeda's infrastructure end to end • Lead top-customer training run responses and write the postmortem • Ensure the health of thousands of GPUs across providers • Build telemetry, GPU health checks, and automated remediation • Define on-call processes like rotations and escalation • Be the reliability voice in customer incident reviews • Collaborate closely with the product team on SLOs • Partner with providers and data center teams on physical design • Make other engineers better through mentorship

🎯 Requirements

• Multiple years building and operating large-scale GPU infrastructure as your primary job • A clear history of owning the reliability of load-bearing infrastructure • Deep, hands-on with NVIDIA H100/H200/B200/GB200 (or equivalent) at scale • Real production experience with InfiniBand, RoCE, and NVLink fabrics • Working knowledge of how large training jobs run — NCCL, CUDA, PyTorch distributed • Strong Go, Python, or Rust proficiency • Expert-level Linux & Systems Internals • Comfortable being the senior engineer on a P0 bridge with the customer • Comfortable being the senior technical voice with AI infra customers

🏖️ Benefits

• Significant autonomy • Working on infrastructure that the most ambitious AI labs depend on

Apply Now

Similar Jobs

🕒 May 20

SouthState Bank

1001 - 5000

🏦 Banking

💸 Finance

💳 Fintech

Payment Platform DevOps Engineer at SouthState enabling secure and scalable delivery of cloud-based payment solutions. Collaborating with internal teams for innovation in payment technology.

🇺🇸 United States – Remote

💵 $152.6k - $243.8k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 May 18

Valiantys - Atlassian Platinum Solution Partner

51 - 200

🏢 Enterprise

☁️ SaaS

🤝 B2B

Director for AI-Enabled DevOps Transformation at Valiantys, focusing on enterprise account growth and strategy alignment. Engage with clients on SDLC modernization and AI-enabled delivery.

🇺🇸 United States – Remote

💵 $175k - $240k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 May 15

Zscaler

5001 - 10000

🔒 Cybersecurity

☁️ SaaS

🏢 Enterprise

Principal DevOps Engineer managing AWS infrastructure for Zscaler’s Zero Trust Networking Services. Architecting cloud infrastructure and ensuring operational health in a remote role.

🇺🇸 United States – Remote

💵 $182k - $260k / year

💰 Secondary Market on 2017-11

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

info

🕒 May 14

Quantiphi

1001 - 5000

🤖 Artificial Intelligence

🏢 Enterprise

📚 Education

Senior DevOps/Observability Engineer leading unified observability platform design for Fortune 500 clients. Focused on architecting observability pipeline using AWS and modern open-source tools.

🕒 May 13

WEX

5001 - 10000

🚗 Transport

💸 Finance

💳 Fintech

SRE Architect driving AI-Powered Reliability Engineering strategy and enforcing enterprise-wide SRE standards. Overseeing the architecture and implementation of mission-critical systems for WEX.

🇺🇸 United States – Remote

💵 $200.6k - $250.4k / year

💰 $310M Post-IPO Debt on 2020-06

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

info