Search Remote Jobs

Senior Site Reliability Engineer

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of HavocAI

HavocAI

11 - 50 employees

Founded 2024

🤖 Artificial Intelligence

🔐 Security

🔧 Hardware

💰 Seed Round on 2024-09

Artificial Intelligence • Security • Hardware

HavocAI is a developer of collaborative autonomy for maritime operations, offering a modular software and vehicle stack that enables fleets of autonomous maritime systems to perform contested logistics, sensor fusion and tracking, domain awareness, and escort-and-engage missions. Their product suite includes onboard autonomy (HAVOC OS), scalable communications (HAVOC CLOUD), and a handheld operator interface (HAVOC CONTROL), marketed as a single solution for theater-scaled security and rapid deployment. HavocAI emphasizes real-time, team-led autonomous solutions that run across diverse environments and supports both hardware (autonomous vessels) and software deployment.

📋 Description

• Design and evolve reliability architecture for distributed and cloud-hosted systems • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning • Partner with platform and application teams to design systems for reliability, scalability, and operability • Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines • Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads • Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews • Conduct root cause analysis for complex production incidents and drive long-term corrective actions • Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews • Reduce operational toil through tooling, automation, and process improvements • Help build a culture of ownership, accountability, and continuous improvement across production systems • Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health • Ensure services and data pipelines are observable, debuggable, and performant in production • Drive performance analysis and tuning across infrastructure, application, and service layers • Improve alert quality, reduce noise, and ensure operational signals are actionable • Partner with engineering teams to define meaningful reliability and performance metrics • Build automation to improve system reliability, deployment safety, and recovery processes • Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns • Support and improve Kubernetes-based environments and containerized workloads • Contribute to infrastructure-as-code practices and platform automation • Help define operational standards for cloud infrastructure, deployment workflows, and production services • Collaborate with security teams to ensure secure and resilient system design • Participate in disaster recovery planning, backup strategy, and resilience testing • Maintain strong operational practices around access control, secrets management, change management, and production access • Support secure operations for systems that may serve defense, autonomy, or mission-sensitive use cases

🎯 Requirements

• 7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles • Strong experience operating large-scale distributed production systems • Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals • Hands-on experience with Kubernetes and container orchestration • Programming or scripting experience in Go, Python, or similar languages • Experience designing and operating observability systems for production environments • Proven ability to lead incident response and drive reliability improvements • Strong communication skills and ability to collaborate across engineering teams • Ability to operate calmly and effectively under pressure • Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required

🏖️ Benefits

• 100% Employer paid Health, Dental and Vision Insurance for you and your families • Life Insurance (Employer Paid) • Ability to participate in the companies 401k program (Matching) • Unlimited PTO policy with an enforced 2 week minimum • Equity Package • Work / Home Office Stipend • Global Entry • 16 Week Paid Parental Leave • Monthly Health and Wellness Stipend

Apply Now

Similar Jobs

🔥 6 hours ago

Ad Hoc LLC

501 - 1000

🏛️ Government

🤖 Artificial Intelligence

🔌 API

Senior DevOps Engineer at Ad Hoc creating scalable digital services and improving software engineering processes. Collaborating with federal agencies to enhance service delivery through technology.

🔥 11 hours ago

Generac

5001 - 10000

⚡ Energy

🔧 Hardware

Senior DevSecOps Engineer at Generac managing cloud services and ensuring security and compliance in data handling. Leading efforts in secure cloud infrastructure design and integrating security in development processes.

🔥 13 hours ago

RethinkFirst

51 - 200

⚕️ Healthcare Insurance

🤖 Artificial Intelligence

📚 Education

DevOps Engineer designing and managing cloud environments and automation tools for RethinkFirst. Delivering CI/CD pipelines, quality code, and incident management in a fast-paced environment.

🔥 14 hours ago

athenahealth

5001 - 10000

⚕️ Healthcare Insurance

☁️ SaaS

🤖 Artificial Intelligence

Lead Site Reliability Engineer enhancing observability and telemetry platform for athenahealth's cloud infrastructure. Collaborating with engineering teams to improve reliability and operational efficiency.

🔥 15 hours ago

TrueML

51 - 200

💳 Fintech

💸 Finance

👥 B2C

Senior DevOps Engineer focusing on cloud architecture and CI/CD at TrueML, enhancing infrastructure scalability and reliability. Engaging in hands-on technical execution and team collaboration.