Manager, Site Reliability Engineering

Job not on LinkedIn

🔥 4 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Aya Healthcare

Aya Healthcare

5001 - 10000 employees

Founded 2001

⚕️ Healthcare Insurance

🎯 Recruiter

Healthcare Insurance • Recruitment

Aya Healthcare is a prominent provider of healthcare staffing services, connecting healthcare professionals such as nurses and allied health workers with healthcare facilities across the United States. They offer travel nursing, per diem, and permanent placement services, ensuring that hospitals and healthcare systems have the necessary staff to provide high-quality patient care. Aya Healthcare focuses on improving the staffing industry through innovative technology and exceptional customer service.

📋 Description

• Lead and grow the SRE team • Drive reliability, performance, and availability • Operational intelligence and AI-native operations • Platform efficiency and stakeholder trust

🎯 Requirements

• 10+ years in a combination of Site Reliability Engineering, DevOps, Platform Engineering, or related production-operations roles. • 4+ years of direct people management experience — hiring, performance management, career development, and running remote on-call teams. • Demonstrated ownership of reliability outcomes for customer-facing SaaS at meaningful scale — defining and operationalizing SLOs/SLIs/error budgets and using them to drive engineering prioritization. • Deep Azure experience — 3+ years operating production workloads on Azure, with hands-on depth in AKS, networking, identity, and platform services. Equivalent depth in AWS or GCP will be considered. • Modern observability fluency — production-grade experience with Datadog (or equivalent: New Relic, Dynatrace, AppDynamics) across metrics, logs, traces, RUM, and synthetics. • AI in operations — hands-on experience integrating AI/LLM-assisted tooling into operational workflows (incident summarization, runbook generation, log analysis, anomaly triage, change risk scoring). • Incident command experience — proven ability to lead severity-1 incidents end-to-end, run blameless reviews, and convert lessons into systemic improvements. • Regulated-environment instinct — operates with HIPAA, PHI, SOC 2, or comparable compliance constraints as a default mindset, not an afterthought. • Executive-grade communication — translates reliability work into business outcomes for executive, product, and customer-facing audiences. • Bachelor's degree in Computer Science, Information Technology, Engineering, or related field — or an equivalent combination of education, training, and experience.

🏖️ Benefits

• Free premium medical, dental, life and vision insurance • Generous 401(k) match • Aya also offers other benefits to those that are eligible and where required by applicable law, including reimbursements and discretionary bonuses • Aya provides paid sick leave in accordance with all applicable state, federal, and local laws. Aya’s general sick leave policy is that employees accrue one hour of paid sick leave for every 30 hours worked. However, to the extent any provisions of the statement above conflict with any applicable paid sick leave laws, the applicable paid sick leave laws are controlling • Celebrations! We hit our goals and reward ourselves. • Company-sponsored virtual events, happy hours and team-building activities are always on the horizon — plus, you get a special treat on your birthday! • Unlimited DTO — we believe in time off! • Virtual yoga, meditation or boot camp classes offered daily

Apply Now

Similar Jobs

🔥 30 minutes ago

Offchain Labs

11 - 50

₿ Crypto

🌐 Web 3

Site Reliability Engineer at Offchain leading a movement in blockchain scalability and security. Tackling real-world challenges and transforming interactions with decentralized applications.

AWS

Azure

Cloud

Google Cloud Platform

Linux

Python

Shell Scripting

Go

🔥 1 hour ago

BeyondTrust

1001 - 5000

🔒 Cybersecurity

Cloud Operations Engineer monitoring, maintaining, and responding to incidents for BeyondTrust Cloud Service. Collaborating across teams to ensure service health and handling cloud environments.

AWS

Azure

Cloud

Distributed Systems

Docker

JavaScript

Kubernetes

Linux

Python

Terraform

🔥 3 hours ago

MKS2 Technologies

201 - 500

🤝 B2B

🔒 Cybersecurity

Site Reliability Systems Engineer working with monitoring tools to enhance VA's infrastructure reliability. Collaborating across teams to resolve outages and improve service quality for veterans.

AWS

Azure

Cloud

Java

JavaScript

Linux

Oracle

ServiceNow

Splunk

Unix

🔥 4 hours ago

VAST Data

501 - 1000

DevOps Engineer developing tools to enhance efficiency for the Sales Engineering team at an AI infrastructure company. Responsible for managing AWS services and backend applications.

Angular

AWS

DNS

Docker

EC2

GraphQL

JavaScript

Linux

MongoDB

Node.js

SCSS

Shell Scripting

Unix

🔥 5 hours ago

Mozilla

501 - 1000

👥 B2C

🔒 Cybersecurity

Senior Site Reliability Engineer establishing infrastructure and operational systems for Thunderbird's open-source email applications. Focusing on reliability improvements and collaboration with distributed teams.

AWS

Grafana

Kubernetes

Terraform