Site Reliability Engineer – AI Infrastructure

Job not on LinkedIn

🕒 February 27

🏄 California – Remote

info

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

info
Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Andromeda

Andromeda

11 - 50 employees

🤖 Artificial Intelligence

🤝 B2B

🔧 Hardware

🔥 Funding within the last year

💰 $15.1M Series A - Andromeda Robotics on 2025-09

Artificial Intelligence • B2B • Hardware

Andromeda is a GPU compute service and marketplace offering instant access to large clusters of H100, H200, and B200 accelerators for experiments, full-scale training, and inference. It supports orchestration with Slurm, Kubernetes, or direct SSH, provides flexible, no-minimum-duration usage and competitive pricing, and includes DevOps expertise, local NAS or streamed storage with no ingress/egress fees, and 24/7 support with industry SLAs. The company also operates a third-party GPU marketplace at gpulist. ai.

📋 Description

• Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers • Build automation and tooling to streamline cluster deployments and integrations • Debug customer issues across networking, storage, scheduling, and system layers • Improve reliability and scalability of both training and inference infrastructure • Design and implement monitoring, alerting, and observability for critical systems • Collaborate with engineering and product teams to plan and deliver infrastructure for new services • Participate in on-call and incident response, leading postmortems and reliability improvements

🎯 Requirements

• 5+ years experience in SRE, DevOps, or infrastructure engineering roles • Strong Linux systems and networking fundamentals • Deep experience with Kubernetes and container orchestration at scale • Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.) • Strong automation and scripting skills (Python, Go, or Bash) • Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.) • Track record of operating production systems and leading incident response

🏖️ Benefits

• Ownership and autonomy to shape systems • Opportunities to work directly with customers and providers

Apply Now

Similar Jobs

🕒 February 26

Twilio

5001 - 10000

Reliability Architect at Twilio defining and leading solutions for reliable products. Collaborating with teams to ensure operational excellence and scalability in high-scale systems design.

🕒 February 26

Knox Systems, Inc.

201 - 500

🏛️ Government

🔒 Cybersecurity

📋 Compliance

Devops Security Engineer at Knox securing cloud-native environments for U.S. government missions. Focus on preventative security, automation, and continuous compliance within FedRAMP frameworks.

🇺🇸 United States – Remote

💵 $110k - $140k / year

🔥 Funding within the last year

💰 $6.5M Seed on 2025-08

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 February 26

JFrog

1001 - 5000

🏢 Enterprise

☁️ SaaS

🔐 Security

Senior Professional Services DevOps Engineer designing CI/CD pipelines at JFrog. Collaborating with clients and teams to enhance DevOps experience.

🕒 February 25

Nick AI

1 - 10

🤖 Artificial Intelligence

₿ Crypto

☁️ SaaS

Backend/DevOps Engineer managing deployments and infrastructure for AI trading platform. Responsible for security, reliability, and scaling of systems across multiple venues.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 February 25

WorkOS

51 - 200

🔌 API

🏢 Enterprise

🤝 B2B

Site Reliability Engineer ensuring reliability and performance at WorkOS across complex systems. Leading incident response and collaborating with cross-functional teams for operational excellence.

🇺🇸 United States – Remote

💵 $175k - $275k / year

💰 $80M Series B - WorkOS on 2022-05

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

info