Site Reliability Engineer – AI Infrastructure

11 - 50 employees

🏥 Healthcare

💼 Consulting

🏨 Hospitality

🔥 Funding within the last year

💰 $15.1M Series A - Andromeda Robotics on 2025-09

Healthcare • Consulting • Hospitality

Andromeda is a GPU compute service and marketplace offering instant access to large clusters of H100, H200, and B200 accelerators for experiments, full-scale training, and inference. It supports orchestration with Slurm, Kubernetes, or direct SSH, provides flexible, no-minimum-duration usage and competitive pricing, and includes DevOps expertise, local NAS or streamed storage with no ingress/egress fees, and 24/7 support with industry SLAs. The company also operates a third-party GPU marketplace at gpulist. ai.

Site Reliability Engineer – AI Infrastructure

Job not on LinkedIn

🕒 February 27

🏄 California – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

Grafana

Kubernetes

Linux

Prometheus

Python

Terraform

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Andromeda

11 - 50 employees

🏥 Healthcare

💼 Consulting

🏨 Hospitality

🔥 Funding within the last year

💰 $15.1M Series A - Andromeda Robotics on 2025-09

Healthcare • Consulting • Hospitality

📋 Description

• Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers • Build automation and tooling to streamline cluster deployments and integrations • Debug customer issues across networking, storage, scheduling, and system layers • Improve reliability and scalability of both training and inference infrastructure • Design and implement monitoring, alerting, and observability for critical systems • Collaborate with engineering and product teams to plan and deliver infrastructure for new services • Participate in on-call and incident response, leading postmortems and reliability improvements

🎯 Requirements

• 5+ years experience in SRE, DevOps, or infrastructure engineering roles • Strong Linux systems and networking fundamentals • Deep experience with Kubernetes and container orchestration at scale • Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.) • Strong automation and scripting skills (Python, Go, or Bash) • Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.) • Track record of operating production systems and leading incident response

🏖️ Benefits

• Ownership and autonomy to shape systems • Opportunities to work directly with customers and providers

Apply Now

Similar Jobs

Backend/DevOps Engineer

🕒 February 25

Nick AI

1 - 10

💼 Consulting

📦 Logistics

🤖 Artificial Intelligence

Backend/DevOps Engineer managing deployments and infrastructure for AI trading platform. Responsible for security, reliability, and scaling of systems across multiple venues.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Docker

Google Cloud Platform

Grafana

Kubernetes

Prometheus

Python

Web3

Site Reliability Engineer

🕒 February 25

WorkOS

51 - 200

🔌 API

🏢 Enterprise

🤝 B2B

Site Reliability Engineer ensuring reliability and performance at WorkOS across complex systems. Leading incident response and collaborating with cross-functional teams for operational excellence.

🇺🇸 United States – Remote

💵 $175k - $275k / year

💰 $80M Series B - WorkOS on 2022-05

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Cloud

Grafana

Kubernetes

Prometheus

TypeScript

Network DevOps Engineer, RDMA Fabric Automation

🕒 February 25

Vultr

201 - 500

🤖 Artificial Intelligence

🤝 B2B

🔧 Hardware

NetDevOps Engineer for RDMA Fabric Automation at Vultr. Automating and operating Ethernet fabrics with a focus on network performance.

🇺🇸 United States – Remote

💵 $90k - $130k / year

💰 $329M Debt Financing - Vultr on 2025-06

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

Grafana

Jenkins

Kafka

Linux

PHP

Prometheus

Python

Rust

DevOps Engineer – Mission-Critical Systems

🕒 February 25

Tactibit Technologies

11 - 50

💼 Consulting

📦 Logistics

🎖️ Defense

DevOps Engineer working at Tactibit Technologies to modernize legacy architectures for mission-critical systems. Collaborate with teams on cloud migrations and automating business processes.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

SRE – Platform Engineer

🕒 February 25

DroneUp

51 - 200

💼 Consulting

📦 Logistics

🏭 Manufacturing

SRE - Platform Engineer at DroneUp focusing on IT infrastructure reliability and scalability. Driving SRE best practices within the team and collaborating on cloud engineering solutions.

🇺🇸 United States – Remote

💵 $125k - $150k / year

💰 $241.2k Seed Round - DroneUp on 2022-07

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Google Cloud Platform

Grafana

Kubernetes

Linux

MacOS

Node.js

Prometheus

Python

Terraform

Unix