Search Remote Jobs

Principal Site Reliability Engineer

🔥 25 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of DraftKings Inc.

DraftKings Inc.

1001 - 5000 employees

Founded 2012

🎲 Gambling

🎮 Gaming

👥 B2C

Gambling • Gaming • B2C

DraftKings Inc. is a digital sports entertainment company operating a leading online sportsbook, daily fantasy sports, and casino platform that delivers real-money betting and gaming experiences via web and mobile apps. It combines sports data, analytics, and content to engage fans, provides marketing and VIP/loyalty programs, and maintains global teams across engineering, product, compliance, and customer experience while emphasizing responsible gaming.

đź“‹ Description

• Define and execute the long-term strategy for our Kubernetes platform across Google Kubernetes Engine, Amazon Elastic Kubernetes Service, RKE2, and on-premise environments, ensuring reliability, scalability, and operational consistency. • Drive architectural decisions across critical infrastructure, including cluster lifecycle management, networking, identity and access management, observability, autoscaling, capacity planning, and cost optimization. • Lead large-scale platform initiatives across multiple engineering teams, establishing technical direction, engineering standards, and measurable outcomes that improve platform reliability and developer experience. • Establish and evolve reliability practices by defining service level objectives, service level indicators, and error budget frameworks that align platform performance with business priorities. • Build automation-first infrastructure through Infrastructure as Code, GitOps workflows, self-healing systems, and internal platform tooling that improve engineering velocity and reduce operational overhead. • Champion the responsible adoption of AI-powered engineering capabilities that improve operational efficiency, accelerate incident response, and enhance developer productivity. • Lead critical platform incidents, drive post-incident improvements, and strengthen platform resilience through automation, capacity planning, and operational excellence. • Mentor senior engineers, influence technical strategy across the organization, and elevate engineering excellence through architecture reviews, coaching, and technical leadership.

🎯 Requirements

• A Bachelor's Degree in Computer Science or a related technical field. • At least 8 years of experience designing, operating, and scaling distributed cloud and on-premise infrastructure, including at least 3 years operating at the Staff, Principal, or equivalent technical leadership level. • Proven experience leading large-scale infrastructure or platform initiatives that require cross-functional alignment and long-term technical ownership. • Deep expertise with Kubernetes, including cluster architecture, networking, storage, security, operators, lifecycle management, and large-scale production operations. • Extensive experience building and operating production infrastructure in AWS and Google Cloud Platform using Infrastructure as Code technologies such as Terraform, Pulumi, or similar tools. • Strong software development experience in Go, Python, or both, with expertise in GitOps, continuous integration and continuous delivery, observability, distributed systems, Linux, and reliability engineering principles. • Experience incorporating AI-powered tools into engineering workflows while applying sound judgment around reliability, security, and operational risk. • Exceptional communication and leadership skills with a proven ability to mentor engineers, influence technical strategy, and drive engineering excellence. • Experience working in regulated industries, hybrid cloud environments, contributing to open-source projects, or holding cloud certifications is preferred.

🏖️ Benefits

• bonus • equity • benefits as applicable

Apply Now

Similar Jobs

🔥 2 hours ago

DraftKings Inc.

1001 - 5000

🎮 Gaming

âš˝ Sports

👥 B2C

Principal Site Reliability Engineer shaping the strategy for Kubernetes platform and driving architectural decisions. Leading platform initiatives at DraftKings with a focus on reliability and automation.

🔥 6 hours ago

Convoso

201 - 500

🤝 B2B

Director of DevOps leading a team of engineers at Convoso, an AI-powered contact center platform. Responsible for developing and optimizing the platform and ensuring service reliability.

🔥 11 hours ago

FluidStack

11 - 50

🤖 Artificial Intelligence

Principal Operations Engineer overseeing critical operations in data centers for Fluidstack. Leading on-call escalation, root cause analysis, and operational excellence in real-time situations.

đź•’ 2 days ago

General Dynamics Information Technology

10,000+ employees

đź”’ Cybersecurity

🤖 Artificial Intelligence

Site Reliability Engineer blending software engineering, automation, and operations expertise. Building scalable platforms and enabling high-velocity delivery for critical Defense systems.

đź•’ 4 days ago

Coinbase

1001 - 5000

₿ Crypto

đź’¸ Finance

đź’ł Fintech

Staff Software Engineer responsible for enhancing reliability and security in production environments. Collaborating on projects to scale systems at Coinbase.