Site Reliability Engineer

Hardware • Enterprise • Artificial Intelligence

Hydra Host is a provider of high-performance computing solutions, offering dedicated bare metal GPU server access optimized for AI and HPC workloads. Their platform allows users to access and rent top-tier GPUs globally, providing unparalleled performance, security, and customization. Hydra Host's infrastructure includes a marketplace, known as Brokkr, that offers a wide array of GPU configurations and solutions tailored for mission-critical applications such as AI, big data, and machine learning. Through their robust, secure, and scalable solutions, Hydra Host ensures customers enjoy full control over their server environments, with options for scalability and future-readiness. The company's offerings are trusted by leading firms seeking efficient and innovative computing solutions.

11 - 50 employees

🔧 Hardware

🏢 Enterprise

🤖 Artificial Intelligence

💰 $10M Seed Round on 2022-04

Site Reliability Engineer

Job not on LinkedIn

October 29

🐊 Florida – Remote

💵 $140k - $200k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Grafana

Kubernetes

Prometheus

Python

Apply Now

Hydra Host

Hardware • Enterprise • Artificial Intelligence

11 - 50 employees

🔧 Hardware

🏢 Enterprise

🤖 Artificial Intelligence

💰 $10M Seed Round on 2022-04

📋 Description

• Design, deploy, and maintain QA systems used by our development teams to test integration and live system responses across full-stack deployments in local, live, and ephemeral environments • Evaluate and integrate monitoring and QA tools to find the right tools for the job • Create a unified monitoring platform and processes that datacenter and device teams will integrate to monitor their components (live servers, lifecycle, networks, power, etc.) • Maintain monitoring processes and dashboards to provide complete visibility into the health, performance, and reliability of our CI systems, software deployments, and testing platforms • Create and maintain a systems test suite, in collaboration with our product managers, to validate marketplace changes against all business functions in live and ephemeral QA environments • Integrate all fore-mentioned systems to create holistic platform health statistics reporting • Design disaster-recovery processes in collaboration with devops • Ensure we are meeting uptime SLAs across all platform deployments • Work with datacenter and device teams to define service-level indicators (SLIs), service-level objectives (SLOs), and SLAs • Establish observability standards across the stack: logs, metrics, traces, and alerts, and actionable on-call playbooks • Automate everything from monitoring setups to incident responses to eliminate manual toil and increase reliability • Drive incident response, root cause analysis, and post‑mortems • Guide incident turn-around into tooling and process improvements • Establish the monitoring infrastructure and dashboards that enable everyone — from engineers to execs — to know what’s going on • Act as the reliability partner to engineering teams: review systems for reliability concerns, help design QA requirements and testing, and help teams meet reliability targets.

🎯 Requirements

• 5–8+ years of experience in Reliability Engineering, DevOps, or infrastructure roles focused on large-scale, high-uptime production environments • Deep familiarity with monitoring and observability tooling: you've implemented and managed systems, esp. Prometheus, Grafana, and Zabbix • Strong experience with service orchestration in mutli-region environment (Nomad, Kubernetes, cloud VMs, distributed databases) • Track record of managing production system uptime and SLAs and building tools to support it • Experience writing and reviewing post-mortems and using those findings to drive improvements in tools and process • Proficient with scripting and programming languages (Python, Go, BASH, etc.) for automating operational tasks • Strong proficiency with infrastructure as code and devops workflows • Experience with distributed tracing, log aggregation, and alert tuning • Passion for building systems that fail gracefully, alert correctly, and empower others to operate confidently • Excellent communication skills: you can write clear documentation, drive incident reviews, and communicate reliability risks to technical and non-technical stakeholders.

🏖️ Benefits

• Competitive compensation: base salary + performance bonus + equity • Exposure to high-performance computing and state-of-the-art GPU environments • A core role in ensuring our systems are reliable, observable, and meet customer SLAs • Remote work environment with a strong culture of ownership and autonomy • No red tape: find the right solution, work with the team, get feedback, and get the job done.

Apply Now

Similar Jobs

Senior Site Reliability Engineer, BCM – DGX Cloud

October 28

NVIDIA

10,000+ employees

🤖 Artificial Intelligence

🎮 Gaming

Senior Site Reliability Engineer ensuring daily operations and incident handling for large scale GPU platforms at NVIDIA. Contributing to feature design and cluster validation for optimal performance and resilience.

🇺🇸 United States – Remote

💵 $168k - $333.5k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Kubernetes

Linux

Python

Site Reliability Engineer, Platform Infrastructure

October 28

Hopper

201 - 500

Senior Site Reliability Engineer at Hopper's Platform Infrastructure team. Building and operating cloud foundation for products used by millions of travelers worldwide.

🇺🇸 United States – Remote

💰 $96M Venture Round on 2022-11

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Distributed Systems

DNS

Google Cloud Platform

Kubernetes

NoSQL

Python

SQL

Terraform

Site Reliability Engineer – Platform Infrastructure

October 28

Hopper

201 - 500

Senior Site Reliability Engineer for platform infrastructure in a growing travel tech company. Enhancing automated, self-service tools for engineers while ensuring performance and reliability.

🇺🇸 United States – Remote

💵 $150k - $350k / year

💰 $96M Venture Round on 2022-11

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Distributed Systems

DNS

Google Cloud Platform

Kubernetes

NoSQL

Python

SQL

Terraform

Senior DevOps Engineer

October 28

SmithRx

51 - 200

⚕️ Healthcare Insurance

☁️ SaaS

🤝 B2B

Sr. DevOps Engineer managing cloud infrastructure and CI/CD for health-tech company. Collaborating across teams and implementing best DevOps practices in a transformative environment.

🇺🇸 United States – Remote

💰 $20M Series B on 2022-03

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Amazon Redshift

AWS

BigQuery

Cloud

Groovy

Kubernetes

NoSQL

Perl

Postgres

Python

Redis

Ruby

SQL

Terraform

DevOps Specialist

October 28

Medical Web Experts

51 - 200

⚕️ Healthcare Insurance

☁️ SaaS

📋 Compliance

DevOps Specialist optimizing cloud infrastructure deployments for a patient engagement healthcare platform. Collaborating with engineering teams to enhance security, automation, and product rollouts.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Cloud

Cyber Security

Google Cloud Platform

Jenkins

Kubernetes

Linux

Microservices

Python

Terraform