Senior Site Reliability Engineer

Marketplace • B2C

GovX is an online platform offering exclusive discounts for current and former military personnel, first responders, and law enforcement. Members can access special deals on a wide range of products, events, tickets, travel offers, and participating brands. The marketplace is designed to provide savings to those who serve, supporting them with benefits and an easy-to-use shopping experience. By partnering with various brands, GovX extends significant discounts as a token of appreciation for the services these individuals provide.

51 - 200 employees

🏪 Marketplace

👥 B2C

Senior Site Reliability Engineer

Job not on LinkedIn

October 29

🏄 California – Remote

⛰️ Colorado – Remote

+5 more states

💵 $165k - $175k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Azure

Cloud

Distributed Systems

Grafana

JavaScript

Kubernetes

Linux

Microservices

Node.js

Prometheus

.NET

Apply Now

GovX

Marketplace • B2C

51 - 200 employees

🏪 Marketplace

👥 B2C

📋 Description

• Maintain scalable, secure, and reliable cloud services ensuring reliable system operations within Service Level Objectives. • Implement and manage monitoring, alerting, and observability systems using Prometheus, Grafana, and Azure Monitor to proactively identify and resolve issues. • Develop and maintain automation scripts and tools in PowerShell, Bash, and C# to improve deployment efficiency, system reliability, and developer productivity. • Create, refine, and maintain detailed runbooks for production systems to ensure consistent operational procedures and effective incident response. • Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to measure and maintain system reliability. • Collaborate with software engineers and automation engineers to integrate reliability practices into CI/CD pipelines using Azure DevOps. • Design and implement intelligent alerting strategies that ensure high signal-to-noise ratios and enable rapid triage of critical issues. • Participate in incident response, post-incident reviews, and blameless root cause analysis to drive continuous improvement of system reliability and uptime. • Contribute to deployment strategy evolution, including blue-green and canary deployments, to minimize downtime and release risk. • Collaborate closely with Automation Engineers to enhance automated validation and testing of production environments. • Monitor system health, capacity, and performance, providing data-driven insights and recommendations for optimization. • Conduct chaos engineering experiments and resilience testing to proactively identify and address system weaknesses. • Develop and maintain disaster recovery and business continuity plans, including regular failover testing. • Participate in the on-call rotation for platform services, ensuring high availability and rapid incident resolution. • Proactively monitor and respond to production support tickets and alerts within established SLA timeframes, delivering first-level diagnosis, troubleshooting, and escalation as needed to maintain system reliability • Continuously improve incident response playbooks and reduce Mean Time to Recovery (MTTR). • Participate in sprint planning, stand-ups, and retrospectives to ensure alignment with development and operational objectives. • Identify opportunities to improve resiliency, reduce toil, and strengthen the reliability culture across the engineering organization. • Collaborate with security and compliance teams to ensure infrastructure meets regulatory and security standards. • Support cost optimization efforts by monitoring cloud resource usage and recommending efficiency improvements. • Explore and integrate AI/ML-based observability tools for predictive monitoring and anomaly detection.

🎯 Requirements

• 8+ years of professional experience in site reliability, infrastructure, or systems engineering roles. • Proficiency with Azure cloud infrastructure, services, and resource management • Experience in operating systems, network concepts, protocols, and architecture. Microsoft/Linux operating systems, active directory, OSI. • Technical ability in Node JS, .NET/C# and knowledge of both current and legacy architecture, software development practices, and conventions. • Strong experience with Rest APIs • Hands-on experience with containerization and orchestration using Kubernetes and microservices architecture. • Strong automation and scripting skills in PowerShell, Bash. • Experience with Infrastructure as Code tools for provisioning and configuration management. • Deep understanding of CI/CD processes and tools, preferably using Azure DevOps. • Experience implementing and managing observability solutions including Azure Monitor, Application Insights, and Log Analytics Workspaces, Prometheus and Grafana. • Strong problem-solving, analytical, and troubleshooting abilities in distributed systems and cloud environments. • Ability to write, maintain, and execute operational runbooks and automation for incident management and recovery. • Ability to work self-directed, plan and execute projects involving multiple technical resources and stakeholders. • Excellent communication and collaboration skills, with the ability to work across software development, infrastructure, and operations teams.

🏖️ Benefits

• Paid Time Off, Paid Sick Leave, Paid Holidays • Competitive Medical, Dental, Vision, and Life Insurance • 401(k) plan with discretionary match available • Flexible Spending Account (FSA), Health Savings Account (HSA) • Voluntary benefits including Critical Illness, Group Accident, and Voluntary Life • Employee Referral Program • Exposure to a growing ecommerce company • Discounts on the GOVX website

Apply Now

Similar Jobs

Senior DevOps Engineer

October 29

Ren

201 - 500

🤲 Charity

🤝 Non-profit

💳 Fintech

Senior DevOps Engineer responsible for designing, building, and leading the delivery of secure cloud infrastructure for Ren's digital services. Mentoring engineers and driving DevOps best practices.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Kubernetes

Python

Terraform

Senior - Principal Site Reliability Engineer

October 29

DataCrunch

11 - 50

Senior Site Reliability Engineer ensuring reliability, scalability, and performance of HPC and cloud systems. Collaborating with European teams and setting standards for operational excellence.

🇺🇸 United States – Remote

💰 Pre Seed Round on 2021-11

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Cloud

Distributed Systems

DNS

Google Cloud Platform

Kubernetes

Linux

Python

Terraform

Senior DevOps Engineer

October 29

UserTesting

501 - 1000

☁️ SaaS

🏢 Enterprise

🤝 B2B

Senior DevOps Engineer focused on cloud infrastructure for UserTesting, ensuring systems are fast and reliable. Collaborating with engineers to deliver exceptional developer experiences.

🇺🇸 United States – Remote

💰 Grant on 2020-11

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Grafana

Jenkins

Kubernetes

Prometheus

Python

Terraform

Site Reliability Engineer III

October 29

Stone & Company

2 - 10

Site Reliability Engineer developing and maintaining critical features for Stone Tech. Responsible for monitoring performance and ensuring reliability across systems.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

🗣️🇧🇷🇵🇹 Portuguese Required

AWS

GRPC

JavaScript

MongoDB

Postgres

SQL

DevOps Engineer

October 29

Lumos

51 - 200

🌐 Web 3

📋 Compliance

☁️ SaaS

DevOps Engineer enhancing and maintaining cloud infrastructure at fast-growing startup Lumos. Collaborates with development and operations teams for automation and scalability.

🇺🇸 United States – Remote

💵 $160k - $190k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Distributed Systems

Docker

Google Cloud Platform

Grafana

Jenkins

Kubernetes

Microservices

Prometheus

Python

Terraform

Vault