Senior Site Reliability Engineer – AI Infrastructure

Trouver des Emplois à Distance Similaires

11 - 50 employés

🤖 Intelligence artificielle

🤝 B2B

🔧 Matériel

🔥 Financement dans la dernière année

💰 €15 142 238 Series A - Andromeda Robotics en 2025-09

Artificial Intelligence • B2B • Hardware

Andromeda est un service de calcul GPU et une place de marché offrant un accès instantané à de grands clusters d'accélérateurs H100, H200 et B200 pour des expériences, des formations à grande échelle et des inférences. Il prend en charge l'orchestration avec Slurm, Kubernetes ou SSH direct, propose une utilisation flexible sans durée minimum et des tarifs compétitifs, inclut une expertise en DevOps, un stockage NAS local ou en streaming sans frais d'entrée/sortie, et un support 24/7 avec des SLA standards de l'industrie. L'entreprise exploite également une place de marché tierce de GPU sur gpulist. ai.

Senior Site Reliability Engineer – AI Infrastructure

Emploi pas sur LinkedIn

🕒 il y a 2 mois

🏄 California – Distant

⏰ Temps Plein

🟠 Senior

⛑ Ingénieur DevOps & SRE

🦅 Parrain de Visa H1B

🗣️🇺🇸🇬🇧 Anglais requis

Distributed Systems

Kubernetes

Linux

Python

PyTorch

Postuler Maintenant

📊 Vérifiez votre score de CV pour ce poste

Améliorez vos chances d'obtenir un entretien en vérifiant votre score de CV avant de postuler.

Andromeda

11 - 50 employés

🤖 Intelligence artificielle

🤝 B2B

🔧 Matériel

🔥 Financement dans la dernière année

💰 €15 142 238 Series A - Andromeda Robotics en 2025-09

Artificial Intelligence • B2B • Hardware

Description

• Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training • Serve as the primary technical point of contact for customers running large-scale training workloads • Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure • Ensure the health and performance of high-speed interconnects • Build deep visibility into GPU utilization, memory pressure, interconnect throughput • Build production-grade automation for cluster provisioning, GPU health checks, job scheduling • Lead incident response for complex failures spanning hardware, networking, orchestration

🎯 Exigences

• Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent) • Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training • Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar • Expert-level Linux knowledge • Strong experience running Kubernetes in production with GPU workloads • Strong engineering skills in Python, Go, or Bash • Hands-on experience building monitoring and alerting for GPU infrastructure • Proven track record leading incident response for complex distributed systems

🏖️ Avantages

• Health insurance • Retirement plans • Paid time off • Flexible work arrangements • Professional development

Postuler Maintenant

Emplois Similaires

SRE – Infra

🕒 il y a 2 mois

PostHog

11 - 50

☁️ SaaS

⚡ Productivité

🏢 Entreprise

SRE role focusing on turning fast-growing systems into predictable, reliable platforms. Join PostHog to build and automate infrastructure.

🇺🇸 États-Unis – Télétravail

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

AWS

Cloud

Kubernetes

Linux

Node.js

Terraform

Senior Infrastructure Engineer/SRE

🕒 il y a 2 mois

Cresta

51 - 200

☁️ SaaS

🤖 Intelligence artificielle

🏢 Entreprise

Senior Infrastructure Engineer/SRE responsible for building core infrastructure at AI-driven contact center company. Designing tools for developers and ensuring reliability across cloud platforms.

🇺🇸 États-Unis – Télétravail

💵 $205 000 - $270 000 / an

⏰ Temps Plein

🟠 Senior

⛑ Ingénieur DevOps & SRE

🦅 Parrain de Visa H1B

🗣️🇺🇸🇬🇧 Anglais requis

AWS

Azure

Cloud

DNS

EC2

Flux

Kubernetes

Postgres

Python

Terraform

Lead Site Reliability Engineer

🕒 il y a 2 mois

Alteryx

1001 - 5000

🤖 Intelligence artificielle

🤝 B2B

Lead Site Reliability Engineer guiding reliability strategy and execution for modern multi-region SaaS platform. Focused on system design, incident management, and cross-team collaboration.

🇺🇸 États-Unis – Télétravail

💵 $136 000 - $177 000 / an

⏰ Temps Plein

🟠 Senior

⛑ Ingénieur DevOps & SRE

🦅 Parrain de Visa H1B

🗣️🇺🇸🇬🇧 Anglais requis

Cloud

Distributed Systems

Grafana

Java

JavaScript

Kubernetes

Python

Staff Software Engineer, Tech Lead – Mobile DevOps

🕒 il y a 2 mois

Toast

1001 - 5000

☁️ SaaS

🤝 B2B

Staff Software Engineer, Tech Lead focused on mobile DevOps at Toast, specializing in Android development and CI/CD processes for restaurant technology.

🇺🇸 États-Unis – Télétravail

💵 $193 000 - $309 000 / an

⏰ Temps Plein

🟠 Senior

⛑ Ingénieur DevOps & SRE

🦅 Parrain de Visa H1B

🗣️🇺🇸🇬🇧 Anglais requis

Android

Cloud

Gradle

Java

Jenkins

Kotlin

React

DevOps Architect / SME, MultiCloud

🕒 il y a 2 mois

EITACIES Inc.

51 - 200

🏢 Entreprise

🔒 Cybersecurity

🤖 Intelligence artificielle