Senior Site Reliability Engineer

🕒 Yesterday

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of HavocAI

HavocAI

11 - 50 employees

Founded 2024

🤖 Artificial Intelligence

🔐 Security

🔧 Hardware

💰 Seed Round on 2024-09

Artificial Intelligence • Security • Hardware

HavocAI is a developer of collaborative autonomy for maritime operations, offering a modular software and vehicle stack that enables fleets of autonomous maritime systems to perform contested logistics, sensor fusion and tracking, domain awareness, and escort-and-engage missions. Their product suite includes onboard autonomy (HAVOC OS), scalable communications (HAVOC CLOUD), and a handheld operator interface (HAVOC CONTROL), marketed as a single solution for theater-scaled security and rapid deployment. HavocAI emphasizes real-time, team-led autonomous solutions that run across diverse environments and supports both hardware (autonomous vessels) and software deployment.

📋 Description

• Design and evolve reliability architecture for distributed and cloud-hosted systems • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning • Partner with platform and application teams to design systems for reliability, scalability, and operability • Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines • Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads • Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews • Conduct root cause analysis for complex production incidents and drive long-term corrective actions • Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews • Reduce operational toil through tooling, automation, and process improvements • Help build a culture of ownership, accountability, and continuous improvement across production systems • Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health • Ensure services and data pipelines are observable, debuggable, and performant in production • Drive performance analysis and tuning across infrastructure, application, and service layers • Improve alert quality, reduce noise, and ensure operational signals are actionable • Partner with engineering teams to define meaningful reliability and performance metrics • Build automation to improve system reliability, deployment safety, and recovery processes • Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns • Support and improve Kubernetes-based environments and containerized workloads • Contribute to infrastructure-as-code practices and platform automation • Help define operational standards for cloud infrastructure, deployment workflows, and production services • Collaborate with security teams to ensure secure and resilient system design • Participate in disaster recovery planning, backup strategy, and resilience testing • Maintain strong operational practices around access control, secrets management, change management, and production access • Support secure operations for systems that may serve defense, autonomy, or mission-sensitive use cases

🎯 Requirements

• 7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles • Strong experience operating large-scale distributed production systems • Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals • Hands-on experience with Kubernetes and container orchestration • Programming or scripting experience in Go, Python, or similar languages • Experience designing and operating observability systems for production environments • Proven ability to lead incident response and drive reliability improvements • Strong communication skills and ability to collaborate across engineering teams • Ability to operate calmly and effectively under pressure • Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required

🏖️ Benefits

• 100% Employer paid Health, Dental and Vision Insurance for you and your families • Life Insurance (Employer Paid) • Ability to participate in the companies 401k program (Matching) • Unlimited PTO policy with an enforced 2 week minimum • Equity Package • Work / Home Office Stipend • Global Entry • 16 Week Paid Parental Leave • Monthly Health and Wellness Stipend

Apply Now

Similar Jobs

🕒 Yesterday

Ad Hoc LLC

501 - 1000

🏛️ Government

🤖 Artificial Intelligence

🔌 API

Senior DevOps Engineer at Ad Hoc creating scalable digital services and improving software engineering processes. Collaborating with federal agencies to enhance service delivery through technology.

AWS

Cloud

JavaScript

Node.js

Postgres

🕒 Yesterday

Generac

5001 - 10000

⚡ Energy

🔧 Hardware

Senior DevSecOps Engineer at Generac managing cloud services and ensuring security and compliance in data handling. Leading efforts in secure cloud infrastructure design and integrating security in development processes.

AWS

Azure

Cloud

Cyber Security

IoT

Terraform

🕒 Yesterday

RethinkFirst

51 - 200

⚕️ Healthcare Insurance

🤖 Artificial Intelligence

📚 Education

DevOps Engineer designing and managing cloud environments and automation tools for RethinkFirst. Delivering CI/CD pipelines, quality code, and incident management in a fast-paced environment.

Azure

Cloud

Kubernetes

SQL

Terraform

🕒 Yesterday

athenahealth

5001 - 10000

⚕️ Healthcare Insurance

☁️ SaaS

🤖 Artificial Intelligence

Lead Site Reliability Engineer enhancing observability and telemetry platform for athenahealth's cloud infrastructure. Collaborating with engineering teams to improve reliability and operational efficiency.

AWS

Cloud

Distributed Systems

ElasticSearch

Grafana

Kafka

Linux

Prometheus

Python

Terraform

Go

🕒 Yesterday

Vouched

11 - 50

📋 Compliance

🔐 Security

🤖 Artificial Intelligence

Senior/Staff DevOps Engineer at Vouched designing, building, and operating cloud infrastructure. Focused on operational excellence and security in identity verification platform.

Cloud

Distributed Systems

Docker

Google Cloud Platform

Grafana

Kubernetes

Node.js

Prometheus

Python

Terraform

TypeScript

Go