Senior Site Reliability Engineer

Job not on LinkedIn

October 29

🏄 California – Remote

info

⛰️ Colorado – Remote

info

+5 more states

info

💵 $165k - $175k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Apply Now
Logo of GovX

GovX

Marketplace • B2C

GovX is an online platform offering exclusive discounts for current and former military personnel, first responders, and law enforcement. Members can access special deals on a wide range of products, events, tickets, travel offers, and participating brands. The marketplace is designed to provide savings to those who serve, supporting them with benefits and an easy-to-use shopping experience. By partnering with various brands, GovX extends significant discounts as a token of appreciation for the services these individuals provide.

51 - 200 employees

🏪 Marketplace

👥 B2C

📋 Description

• Maintain scalable, secure, and reliable cloud services ensuring reliable system operations within Service Level Objectives. • Implement and manage monitoring, alerting, and observability systems using Prometheus, Grafana, and Azure Monitor to proactively identify and resolve issues. • Develop and maintain automation scripts and tools in PowerShell, Bash, and C# to improve deployment efficiency, system reliability, and developer productivity. • Create, refine, and maintain detailed runbooks for production systems to ensure consistent operational procedures and effective incident response. • Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to measure and maintain system reliability. • Collaborate with software engineers and automation engineers to integrate reliability practices into CI/CD pipelines using Azure DevOps. • Design and implement intelligent alerting strategies that ensure high signal-to-noise ratios and enable rapid triage of critical issues. • Participate in incident response, post-incident reviews, and blameless root cause analysis to drive continuous improvement of system reliability and uptime. • Contribute to deployment strategy evolution, including blue-green and canary deployments, to minimize downtime and release risk. • Collaborate closely with Automation Engineers to enhance automated validation and testing of production environments. • Monitor system health, capacity, and performance, providing data-driven insights and recommendations for optimization. • Conduct chaos engineering experiments and resilience testing to proactively identify and address system weaknesses. • Develop and maintain disaster recovery and business continuity plans, including regular failover testing. • Participate in the on-call rotation for platform services, ensuring high availability and rapid incident resolution. • Proactively monitor and respond to production support tickets and alerts within established SLA timeframes, delivering first-level diagnosis, troubleshooting, and escalation as needed to maintain system reliability • Continuously improve incident response playbooks and reduce Mean Time to Recovery (MTTR). • Participate in sprint planning, stand-ups, and retrospectives to ensure alignment with development and operational objectives. • Identify opportunities to improve resiliency, reduce toil, and strengthen the reliability culture across the engineering organization. • Collaborate with security and compliance teams to ensure infrastructure meets regulatory and security standards. • Support cost optimization efforts by monitoring cloud resource usage and recommending efficiency improvements. • Explore and integrate AI/ML-based observability tools for predictive monitoring and anomaly detection.

🎯 Requirements

• 8+ years of professional experience in site reliability, infrastructure, or systems engineering roles. • Proficiency with Azure cloud infrastructure, services, and resource management • Experience in operating systems, network concepts, protocols, and architecture. Microsoft/Linux operating systems, active directory, OSI. • Technical ability in Node JS, .NET/C# and knowledge of both current and legacy architecture, software development practices, and conventions. • Strong experience with Rest APIs • Hands-on experience with containerization and orchestration using Kubernetes and microservices architecture. • Strong automation and scripting skills in PowerShell, Bash. • Experience with Infrastructure as Code tools for provisioning and configuration management. • Deep understanding of CI/CD processes and tools, preferably using Azure DevOps. • Experience implementing and managing observability solutions including Azure Monitor, Application Insights, and Log Analytics Workspaces, Prometheus and Grafana. • Strong problem-solving, analytical, and troubleshooting abilities in distributed systems and cloud environments. • Ability to write, maintain, and execute operational runbooks and automation for incident management and recovery. • Ability to work self-directed, plan and execute projects involving multiple technical resources and stakeholders. • Excellent communication and collaboration skills, with the ability to work across software development, infrastructure, and operations teams.

🏖️ Benefits

• Paid Time Off, Paid Sick Leave, Paid Holidays • Competitive Medical, Dental, Vision, and Life Insurance • 401(k) plan with discretionary match available • Flexible Spending Account (FSA), Health Savings Account (HSA) • Voluntary benefits including Critical Illness, Group Accident, and Voluntary Life • Employee Referral Program • Exposure to a growing ecommerce company • Discounts on the GOVX website

Apply Now

Similar Jobs

October 29

Ren

201 - 500

🤲 Charity

🤝 Non-profit

💳 Fintech

Senior DevOps Engineer responsible for designing, building, and leading the delivery of secure cloud infrastructure for Ren's digital services. Mentoring engineers and driving DevOps best practices.

October 29

DataCrunch

11 - 50

Senior Site Reliability Engineer ensuring reliability, scalability, and performance of HPC and cloud systems. Collaborating with European teams and setting standards for operational excellence.

🇺🇸 United States – Remote

💰 Pre Seed Round on 2021-11

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

October 29

UserTesting

501 - 1000

☁️ SaaS

🏢 Enterprise

🤝 B2B

Senior DevOps Engineer focused on cloud infrastructure for UserTesting, ensuring systems are fast and reliable. Collaborating with engineers to deliver exceptional developer experiences.

🇺🇸 United States – Remote

💰 Grant on 2020-11

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

October 29

Site Reliability Engineer developing and maintaining critical features for Stone Tech. Responsible for monitoring performance and ensuring reliability across systems.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

🗣️🇧🇷🇵🇹 Portuguese Required

October 29

Lumos

51 - 200

🌐 Web 3

📋 Compliance

☁️ SaaS

DevOps Engineer enhancing and maintaining cloud infrastructure at fast-growing startup Lumos. Collaborates with development and operations teams for automation and scalability.

🇺🇸 United States – Remote

💵 $160k - $190k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com