Senior Site Reliability Engineer – AI Infrastructure

Emploi pas sur LinkedIn

🕒 il y a 2 mois

🏄 California – Distant

info

⏰ Temps Plein

🟠 Senior

⛑ Ingénieur DevOps & SRE

🦅 Parrain de Visa H1B

info

🗣️🇺🇸🇬🇧 Anglais requis

Postuler Maintenant
Trouver des Emplois à Distance Similaires

📊 Vérifiez votre score de CV pour ce poste

Améliorez vos chances d'obtenir un entretien en vérifiant votre score de CV avant de postuler.

Logo of Andromeda

Andromeda

11 - 50 employés

🤖 Intelligence artificielle

🤝 B2B

🔧 Matériel

🔥 Financement dans la dernière année

💰 €15 142 238 Series A - Andromeda Robotics en 2025-09

Artificial Intelligence • B2B • Hardware

Andromeda est un service de calcul GPU et une place de marché offrant un accès instantané à de grands clusters d'accélérateurs H100, H200 et B200 pour des expériences, des formations à grande échelle et des inférences. Il prend en charge l'orchestration avec Slurm, Kubernetes ou SSH direct, propose une utilisation flexible sans durée minimum et des tarifs compétitifs, inclut une expertise en DevOps, un stockage NAS local ou en streaming sans frais d'entrée/sortie, et un support 24/7 avec des SLA standards de l'industrie. L'entreprise exploite également une place de marché tierce de GPU sur gpulist. ai.

Description

• Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training • Serve as the primary technical point of contact for customers running large-scale training workloads • Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure • Ensure the health and performance of high-speed interconnects • Build deep visibility into GPU utilization, memory pressure, interconnect throughput • Build production-grade automation for cluster provisioning, GPU health checks, job scheduling • Lead incident response for complex failures spanning hardware, networking, orchestration

🎯 Exigences

• Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent) • Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training • Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar • Expert-level Linux knowledge • Strong experience running Kubernetes in production with GPU workloads • Strong engineering skills in Python, Go, or Bash • Hands-on experience building monitoring and alerting for GPU infrastructure • Proven track record leading incident response for complex distributed systems

🏖️ Avantages

• Health insurance • Retirement plans • Paid time off • Flexible work arrangements • Professional development

Postuler Maintenant

Emplois Similaires

🕒 il y a 2 mois

PostHog

11 - 50

☁️ SaaS

⚡ Productivité

🏢 Entreprise

SRE role focusing on turning fast-growing systems into predictable, reliable platforms. Join PostHog to build and automate infrastructure.

🇺🇸 États-Unis – Télétravail

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 2 mois

Cresta

51 - 200

☁️ SaaS

🤖 Intelligence artificielle

🏢 Entreprise

Senior Infrastructure Engineer/SRE responsible for building core infrastructure at AI-driven contact center company. Designing tools for developers and ensuring reliability across cloud platforms.

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 2 mois

Alteryx

1001 - 5000

🤖 Intelligence artificielle

🤝 B2B

Lead Site Reliability Engineer guiding reliability strategy and execution for modern multi-region SaaS platform. Focused on system design, incident management, and cross-team collaboration.

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 2 mois

Toast

1001 - 5000

☁️ SaaS

🤝 B2B

Staff Software Engineer, Tech Lead focused on mobile DevOps at Toast, specializing in Android development and CI/CD processes for restaurant technology.

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 2 mois

EITACIES Inc.

51 - 200

🏢 Entreprise

🔒 Cybersecurity

🤖 Intelligence artificielle

DevOps Architect leading platform engineering standards across a multi-cloud, hybrid environment at Eitacies Inc. Focus on automation, infrastructure, and cloud architecture.

🇺🇸 États-Unis – Télétravail

💵 $60 / heure

⏰ Temps Plein

🟠 Senior

🔴 Expert

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis