Lead Site Reliability Developer – CSRE Consulting

🕒 May 1

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Live Nation Entertainment

Live Nation Entertainment

10,000+ employees

Founded 1996

📱 Media

💰 Post-IPO Debt on 2023-01

Media • Entertainment

Live Nation Entertainment is the global leader in live entertainment, powering unforgettable experiences around the world. Artist-powered and fan-driven, Live Nation works with musicians to bring their creativity to life on stages across the globe. As the top producer of concerts, ticket seller, and brand connector to music, Live Nation's platform leads the market in these three core industries. Their mission extends beyond entertainment, aiming to uplift, inspire, and create memories through the power of live music.

📋 Description

• Lead consulting work from discovery through delivery by aligning stakeholders on priorities, sequencing work, and communicating measurable outcomes. • Establish working cadence and facilitate decision forums to surface risks, map dependencies, and drive clear ownership and timelines. • Align product, platform, and engineering stakeholders on reliability targets and trade-offs using SLOs and error budgets. • Partner regularly with Engineering Managers, product managers, Staff and Principal engineers, and platform leads to keep dependencies, decisions, and delivery aligned. • Identify systemic risks across shared dependencies and coordinate remediation across multiple teams to reduce recurring incidents. • Drive change adoption by embedding reliability mechanisms into partner team routines such as planning, PRRs, and on-call practices. • Design and implement reusable reliability mechanisms, templates, and tooling that can be adopted across teams. • Establish and evolve production readiness review practices with partner teams to improve launch quality and change safety. • Drive observability strategy for partner domains by improving signal quality, alerting philosophy, and operational dashboards. • Lead complex incident investigations and ensure learnings translate into durable fixes with clear owners and verification. • Lead reliability-focused design and code reviews and guide teams toward simpler, safer architectures. • Mentor Senior engineers and other consultants through pairing, reviews, and structured coaching to multiply impact. • Partner with internal platform engineering to influence roadmaps and deliver shared capabilities that accelerate SRE adoption. • Improve CSRE Consulting playbooks and operating practices based on repeated patterns observed across teams.

🎯 Requirements

• Deep practical understanding of SRE principles, including SLO governance and error budget policy in practice • Proven ability to lead cross-team technical work and influence without authority • Strong experience designing and troubleshooting distributed systems with cross-service failure modes • Experience shaping observability and alerting strategy and improving operational signal quality • Strong Kubernetes and AWS experience, including governance and cost trade-offs • Ability to design reliability automation and tooling that is reusable and adopted by multiple teams • Experience leading production readiness and resilience practices, including DR validation and controlled testing • Strong software engineering fundamentals with the ability to deliver and review high-quality changes in enterprise codebases • Advanced incident analysis skills focused on systemic risk reduction and organizational learning • Excellent communication skills, including exec-ready summaries and clear technical diagrams.

🏖️ Benefits

• Generous vacation • Healthcare • Retirement benefits • Student loan repayment • Tuition reimbursement • Six months of paid caregiver leave for new parents including fostering • Access to free live events through our exclusive employee ticketing program

Apply Now

Similar Jobs

🕒 April 30

Civica US

51 - 200

🏛️ Government

☁️ SaaS

📚 Education

Senior Site Reliability Engineer ensuring the reliability, performance and security of Civica’s cloud platform. Collaborating with teams to drive automation and best practices in cloud environments.

Ansible

AWS

Azure

Cloud

Google Cloud Platform

Grafana

Java

Kubernetes

OpenShift

Packer

Prometheus

Python

Terraform

VMware

Go

.NET

🕒 April 25

Atos

10,000+ employees

🔒 Cybersecurity

DevOps Engineer supporting cloud transformation and application portfolios for clients. Collaborating with stakeholders and developers to improve technology and infrastructure in a remote-first environment.

AWS

Azure

Cloud

Cyber Security

Docker

Kubernetes

🕒 April 24

GitLab

1001 - 5000

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Cloud Cost Utilization SRE responsible for making cloud spending actionable. Collaborating with Finance and Engineering at GitLab to optimize resource usage.

Ansible

AWS

Cloud

Google Cloud Platform

Grafana

Prometheus

Terraform

🕒 April 24

Lyrebird Health

11 - 50

⚕️ Healthcare Insurance

☁️ SaaS

🤖 Artificial Intelligence

Senior SRE at Lyrebird tasked with managing the reliability and scalability of production systems. Build infrastructure and deployment patterns to support AI-powered healthcare tools.

AWS

Cloud

Distributed Systems

Docker

EC2

Kubernetes

🕒 April 22

NICE

5001 - 10000

☁️ SaaS

🤖 Artificial Intelligence

📡 Telecommunications

SRE - NOC role focuses on service reliability, incident response, and operational automation. Precision in dealing with operational toil through engineering practices for global operations at NICE.

Ansible

AWS

Cloud

DNS

Docker

Grafana

Kubernetes

Linux

Prometheus

Python

Splunk

TCP/IP

Terraform

Go