Post a Job Affiliates

Search Remote Jobs

Andromeda

Website LinkedIn All Job Openings

11 - 50 employees

🏥 Healthcare

💼 Consulting

🏨 Hospitality

🔥 Funding within the last year

💰 $15.1M Series A - Andromeda Robotics on 2025-09

Healthcare • Consulting • Hospitality

Andromeda is a GPU compute service and marketplace offering instant access to large clusters of H100, H200, and B200 accelerators for experiments, full-scale training, and inference. It supports orchestration with Slurm, Kubernetes, or direct SSH, provides flexible, no-minimum-duration usage and competitive pricing, and includes DevOps expertise, local NAS or streamed storage with no ingress/egress fees, and 24/7 support with industry SLAs. The company also operates a third-party GPU marketplace at gpulist. ai.

Senior Site Reliability Engineer – AI Infrastructure

Job not on LinkedIn

🕒 April 9

🏄 California – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Distributed Systems

Kubernetes

Linux

Python

PyTorch

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Andromeda

Website LinkedIn All Job Openings

11 - 50 employees

🏥 Healthcare

💼 Consulting

🏨 Hospitality

🔥 Funding within the last year

💰 $15.1M Series A - Andromeda Robotics on 2025-09

Healthcare • Consulting • Hospitality

📋 Description

• Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training • Serve as the primary technical point of contact for customers running large-scale training workloads • Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure • Ensure the health and performance of high-speed interconnects • Build deep visibility into GPU utilization, memory pressure, interconnect throughput • Build production-grade automation for cluster provisioning, GPU health checks, job scheduling • Lead incident response for complex failures spanning hardware, networking, orchestration

🎯 Requirements

• Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent) • Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training • Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar • Expert-level Linux knowledge • Strong experience running Kubernetes in production with GPU workloads • Strong engineering skills in Python, Go, or Bash • Hands-on experience building monitoring and alerting for GPU infrastructure • Proven track record leading incident response for complex distributed systems

🏖️ Benefits

• Health insurance • Retirement plans • Paid time off • Flexible work arrangements • Professional development

Apply Now

Similar Jobs

Senior Infrastructure Engineer/SRE

🕒 April 9

Cresta

51 - 200

☁️ SaaS

🤖 Artificial Intelligence

🏢 Enterprise

Website LinkedIn All Job Openings

Senior Infrastructure Engineer/SRE responsible for building core infrastructure at AI-driven contact center company. Designing tools for developers and ensuring reliability across cloud platforms.

🇺🇸 United States – Remote

💵 $205k - $270k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

DNS

EC2

Flux

Kubernetes

Postgres

Python

Terraform

Apply

View Job

DevOps Architect / SME, MultiCloud

🕒 April 8

EITACIES Inc.

51 - 200

💼 Consulting

🏥 Healthcare

🏭 Manufacturing

Website LinkedIn All Job Openings

DevOps Architect leading platform engineering standards across a multi-cloud, hybrid environment at Eitacies Inc. Focus on automation, infrastructure, and cloud architecture.

🇺🇸 United States – Remote

💵 $60 / hour

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

DNS

Docker

DynamoDB

Firewalls

Google Cloud Platform

Kubernetes

Python

SQL

Terraform

Apply

View Job

Senior Devops Engineer

🕒 April 3

Avive Solutions Inc.

11 - 50

🏥 Healthcare

💼 Consulting

📦 Logistics

Website LinkedIn All Job Openings

DevOps Engineer for Avive Solutions, building cloud infrastructure to revolutionize cardiac arrest responses. Collaborate cross-functionally to optimize systems for high-impact healthcare technology.

🇺🇸 United States – Remote

💵 $140k - $180k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Docker

Kubernetes

Linux

Python

Terraform

Apply

View Job