Senior Site Reliability Engineer

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Akuity

Akuity

11 - 50 employees

Founded 2021

🏢 Enterprise

☁️ SaaS

Enterprise • SaaS • Cloud

Akuity is a company that provides an end-to-end GitOps platform for Kubernetes, focusing on deployment, promotion, and monitoring with tools like Argo CD, Kargo, and KubeVision. The platform extends Kubernetes APIs to enhance continuous delivery, container orchestration, and event automation. Akuity's solutions aim to simplify infrastructure management, improve security, and increase deployment efficiency, demonstrating significant time savings and improved shipping velocity for developers. Akuity also offers enterprise support for Argo, emphasizing security, compliance, and scalability for Kubernetes deployments.

📋 Description

• Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them • Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure • Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes • Partner with engineering teams to build reliability into new features before they ship to production • Participate in an on-call rotation and act as incident commander for high-severity production events • Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low • Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil • Lead post-incident reviews with clear timelines, root cause analysis, and follow-through on action items

🎯 Requirements

• 5+ years of SRE, platform engineering, or production operations experience in a SaaS environment • Deep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anything • Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM • Experience defining and operating against SLOs in production; you've written error budgets, not just read about them • Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent) • Solid scripting and automation skills; Go, Python, Bash, or similar; you automate what you touch • Strong written communication: clear runbooks, sharp incident reports, thoughtful post-mortems • Live within US time zones (Pacific through Eastern), including Canada and other regions

🏖️ Benefits

• Competitive compensation, commensurate with experience • Equity participation in a well-funded, growing company • Fully remote: work from anywhere within US time zones (Pacific through Eastern), including Canada and other regions • Home office stipend and equipment budget • Flexible time off and a culture that respects it • Work directly with the engineers who built Argo CD and Kargo; you'll learn a lot here • US-based employees receive full benefits, including comprehensive health, dental, and vision coverage. Candidates based outside the US will be engaged as contractors.

Apply Now

Similar Jobs

🔥 3 hours ago

Your Software Supplier

51 - 200

🏪 Marketplace

🤝 B2B

☁️ SaaS

DevOps Engineer managing cloud infrastructure at Your Software Supplier. Leveraging Azure, DevOps, Kubernetes, and Docker for seamless deployment pipelines.

🇺🇸 United States – Remote

💵 $160k - $220k / year

💰 $50k Pre Seed Round - Your Software Supplier on 2019-08

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🔥 5 hours ago

Akkadian Labs

51 - 200

☁️ SaaS

🏢 Enterprise

📡 Telecommunications

DevOps Engineer supporting scalable and secure infrastructure and DevOps processes at Akkadian Labs. Collaborating with development and product teams for reliable deployments and automation.

🔥 8 hours ago

Sanity.io

51 - 200

☁️ SaaS

🛍️ eCommerce

📱 Media

SRE managing scalable content operations infrastructure for AI-powered platform. Collaborating with dev teams and ensuring reliability for high request volume systems.

🔥 19 hours ago

Global Alliant Inc

51 - 200

🤖 Artificial Intelligence

🏢 Enterprise

🏛️ Government

Senior Full Stack Software Engineer in Agile teams supporting federal technology initiatives. Responsible for building secure, scalable, cloud-native applications using modern tech stacks.

🔥 22 hours ago

Cayuse

201 - 500

☁️ SaaS

📋 Compliance

⚡ Productivity

Senior DevSecOps Engineer at Cayuse enhancing security and reliability of cloud-based SaaS products. Leading vulnerability management program and mentoring team members in secure development practices.