Staff Site Reliability Engineer

201 - 500 employees

Founded 2015

🔒 Cybersecurity

☁️ SaaS

🏛️ Government

🔥 Funding within the last year

💰 $39M Venture Round - SimSpace on 2025-10

Cybersecurity • SaaS • Government

SimSpace is a cybersecurity company that provides a realistic, intelligent cyber range platform for training, testing, and validating security teams, tools, and AI agents. Its platform enables live-fire exercises, threat emulation (full kill-chain and atomic), validation of controls and agentic workflows, and disaster recovery and compliance testing; it is used by enterprises and government customers to build cyber readiness and resilience. Founded by experts from U. S. Cyber Command and MIT Lincoln Laboratory, SimSpace focuses on upskilling individuals, strengthening teams, and evaluating AI-driven defenses in realistic, production-like simulations.

Staff Site Reliability Engineer

🕒 4 days ago

🇺🇸 United States – Remote

💵 $165k - $230k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Distributed Systems

Grafana

Kubernetes

Python

VMware

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

SimSpace

201 - 500 employees

Founded 2015

🔒 Cybersecurity

☁️ SaaS

🏛️ Government

🔥 Funding within the last year

💰 $39M Venture Round - SimSpace on 2025-10

Cybersecurity • SaaS • Government

📋 Description

• Design and architect the overarching infrastructure strategy that enables consistent, repeatable, and secure deployments across SimSpace-hosted data centers, customer-provided hardware, and highly restricted air-gapped environments. • Lead the evolution of our CI/CD and Kubernetes platforms. Drive advanced application packaging, templating, and configuration management strategies using Jsonnet and Grafana Tanka (alongside Kustomize). Move beyond maintaining pipelines to architecting multi-cluster, multi-environment deployment frameworks that drastically improve developer velocity. • Define, measure, and govern Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets across the engineering organization. Partner with product and engineering leadership to balance feature delivery with platform stability. • Architect our enterprise observability strategy using the Grafana stack. Design frameworks for proactive monitoring, complex anomaly detection, and distributed tracing that give teams unparalleled visibility into system health, pod scaling, and latency bottlenecks. • Drive the infrastructure security posture at an architectural level. Embed advanced container security, zero-trust network segmentation, and automated compliance policies directly into our deployment pipelines and runtime environments. • Serve as a strategic partner and consultant to development teams. Advocate for an "SRE culture" by designing self-service tooling, establishing "paved roads" for developers, and reducing operational toil across the entire engineering org. • Act as an Incident Commander during complex, high-severity outages. Drive blameless post-mortems and engineer long-term, systemic, and architectural fixes to ensure classes of failures never repeat. • Act as a technical mentor to senior and mid-level engineers. Raise the baseline of engineering excellence across the company by coaching, documenting best practices, and leading by example.

🎯 Requirements

• 8+ years of experience in Site Reliability, Platform, or DevOps engineering, with a proven track record of operating at a Staff, Principal, or Lead level to drive organization-wide infrastructure initiatives. • You possess deep software engineering skills (beyond scripting) and can architect complex, production-quality systems. You design clean interfaces, build maintainable tooling, and can dictate the technical direction of our internal toolchain. Language agnostic, but highly proficient in at least one modern language (e.g., Go, Python). • Deep, architectural understanding of Kubernetes in multi-tenant and multi-cluster production environments. You possess expert-level knowledge of Jsonnet and Grafana Tanka for managing complex, scalable Kubernetes configurations and application packaging. • Extensive experience architecting sophisticated CI/CD pipelines and GitOps workflows using GitHub Actions, ArgoCD, and infrastructure-as-code principles at an enterprise scale. • Systems-level thinking with the ability to design architectures that span self-hosted, on-premises, VMware-based, and air-gapped deployment models. • Deep expertise with observability platforms (Grafana stack preferred) and a proven ability to design alerting and monitoring strategies for complex distributed systems. • Strong background in infrastructure security architecture, including container hardening, network security, vulnerability management, and delivering software to heavily regulated or customer-managed environments. • Exceptional communication and stakeholder management skills. You have a service-oriented mindset, but you also have the ability to influence cross-functional leadership, negotiate reliability tradeoffs, and align engineering teams behind a unified technical vision.

🏖️ Benefits

• Comprehensive medical, dental, and vision benefits, plus savings plans—coverage starts on day one! • Access to company-paid counseling, coaching, and resources for you and your family through Spring Health. • Plan for your future with a 401(k)-retirement savings plan featuring a company match. • Take the time you need with unlimited vacation and dedicated health & wellness days. SimSpace provides flexible solutions to meet the diverse work-life needs of team members. • Paid leave plans to support you and your loved ones during life’s most important moments. • Equity stock options at hire, with annual performance-based grants—become an invested stakeholder in our shared success. • Earn $1,500–$3,500 for every qualified hire through our employee referral program. • Full- and partial- subsidized membership plans and equipment discounts to help you reach your personalized fitness goals. • Access a LinkedIn Learning membership to prioritize your personal and professional development. • Monthly reimbursements for meaningful connections with teammates through our SocialSpace Community. • Legal plan coverage, pet insurance, wellness reimbursements, and more to simplify life’s details.

Apply Now

Similar Jobs

DevOps Engineer

🕒 4 days ago

SES Corporation

51 - 200

🏢 Enterprise

🏛️ Government

☁️ SaaS

DevOps Engineer supporting U.S. Air Force Cloud One Architecture and Shared Services contract. Involve in managing multi-cloud environments for system resiliency and security.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Cloud

Cyber Security

Grafana

Jenkins

Oracle

Prometheus

Terraform

Director, Infrastructure & Site Reliability Engineering

🕒 5 days ago

Alteryx

1001 - 5000

🤖 Artificial Intelligence

🤝 B2B

Director of Infrastructure & Site Reliability Engineering at Alteryx overseeing cloud infrastructure and ensuring operational excellence. Leading multiple teams to enhance platform efficiency and reliability.

🇺🇸 United States – Remote

💵 $181.9k - $239.6k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Cloud

Distributed Systems

Google Cloud Platform

Grafana

Kubernetes

Prometheus

Terraform

DevOps Engineer

🕒 5 days ago

Leidos

10,000+ employees

🔒 Cybersecurity

🔬 Science

OCI DevOps Engineer supporting Leidos' DHMSM IDIQ contract with cloud infrastructure and management. Focused on automation and optimization of infrastructure in a secure, global environment.

🇺🇸 United States – Remote

💵 $107.9k - $195.1k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

AWS

Azure

Cloud

Docker

Google Cloud Platform

Jenkins

Kubernetes

Linux

Python

Ruby

Terraform

Customer Reliability Engineer

🕒 July 11

Cisco

10,000+ employees

🔧 Hardware

🔐 Security

🏢 Enterprise

Customer Reliability Engineer managing complex escalations for Cisco Hypershield in enterprise environments. Collaborating with engineering to improve product reliability and customer satisfaction.

🇺🇸 United States – Remote

💵 $158.2k - $200.7k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Kubernetes

Linux

Switching

Gen AI Security, DevSecOps Engineer

🕒 July 11

NBA

11 - 50

🏠 Real Estate

🤝 B2B

Gen AI Security & DevSecOps Engineer ensuring NBA's security infrastructure and AI adoption. Responsible for security across CI/CD pipelines, generative AI, and cloud.

🇺🇸 United States – Remote

💵 $145k - $165k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Kubernetes

SDLC