Staff Site Reliability Engineer – Volcano

🔥 23 hours ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Kong Inc.

Kong Inc.

201 - 500 employees

Founded 2017

🔌 API

☁️ SaaS

🏢 Enterprise

💰 $100M Series D on 2021-02

API • SaaS • Enterprise

Kong Inc. is a company that provides a comprehensive API platform designed to facilitate API management, AI integration, and developer productivity. It offers solutions like Kong Gateway, Kong Konnect, and a variety of other tools targeted at managing and optimizing the API lifecycle. Kong's platform supports multi-cloud environments and is built to deliver high performance and security. It is notably recognized by Gartner as a leader in API management and supports innovations across industries like financial services, healthcare, and technology. The company emphasizes flexibility, security, and speed, making it a favored choice for enterprises looking to enhance their digital services through APIs. Kong also supports a robust community of developers and provides extensive integrations and plugins to streamline API management and operations.

📋 Description

• Own reliability for Volcano end-to-end: Define and drive SLOs, error budgets, and incident response practices for all Volcano services — edge deployments, managed Postgres, auth, realtime, storage, and the control plane. • Architect the platform's infrastructure: Design and build the multi-region Kubernetes infrastructure, networking, and data plane that powers Volcano's edge deployment pipeline and backend-as-a-service capabilities. • Build the GitOps and CI/CD backbone: Establish deployment automation, canary pipelines, and preview environment provisioning using ArgoCD, Helm, and Terraform/Terragrunt — setting patterns the broader team will follow. • Scale managed data services: Design, operate, and harden multi-tenant PostgreSQL clusters, Redis caching layers, and object storage — with a focus on data isolation, performance, and disaster recovery. • Drive observability from day one: Instrument every Volcano service with meaningful SLIs; build dashboards, alerts, and runbooks using Datadog, Prometheus, and Grafana before services go live, not after incidents. • Lead cross-functional reliability work: Collaborate with the OCTO team, product engineering, and security to bake reliability and compliance into Volcano's architecture — not bolt it on later. • Set SRE culture and standards: Mentor engineers across Volcano's contributing teams on reliability principles; lead postmortems, define on-call practices, and build a blameless engineering culture. • Evaluate and adopt emerging technologies: Given Volcano's greenfield nature, evaluate and make architectural decisions on edge runtimes, serverless compute, vector databases, and AI-native infrastructure components.

🎯 Requirements

• BS in Computer Science or equivalent; substantial experience at Staff or Principal IC level in SRE/Platform Engineering. • Proven track record building SRE or platform engineering practices for developer-facing platforms or PaaS/SaaS products — ideally at greenfield stage. • Deep Kubernetes expertise: multi-tenant cluster design, networking (CNI, service mesh, ingress), autoscaling, and security hardening.

🏖️ Benefits

• healthcare benefits • 401(k) plan • short and long term disability benefits • basic life and AD&D insurance

Apply Now

Similar Jobs

🕒 3 days ago

Gorilla Logic

501 - 1000

☁️ SaaS

🏢 Enterprise

🤖 Artificial Intelligence

Technical Engineering Manager leading high-performing cloud and DevOps teams. Guiding architecture and delivery of scalable, reliable, and secure cloud solutions for clients.

AWS

Azure

Cloud

Distributed Systems

Google Cloud Platform

Microservices

🕒 4 days ago

General Dynamics Information Technology

10,000+ employees

🔒 Cybersecurity

🤖 Artificial Intelligence

DevSecOps Engineer developing and operating security automation platforms for Department of Defense and Federal customers. Focus on hands-on software development within a DevSecOps context.

Ansible

Docker

Kubernetes

Linux

Terraform

🕒 6 days ago

ClassWallet

11 - 50

💳 Fintech

📚 Education

🏛️ Government

DevOps Engineer optimizing cloud infrastructure and deployment pipelines for fintech company. Redefining public funds management and ensuring system reliability with high compliance standards.

AWS

Cloud

Docker

EC2

Grafana

Kubernetes

Node.js

Prometheus

Terraform

🕒 6 days ago

Domino Data Lab

201 - 500

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Staff Site Reliability Engineer working on AI-assisted reliability tooling at Domino Data Lab. Leading incident response and enhancing system observability for critical services.

Cloud

Kubernetes

Linux

Python

Go

🕒 June 16

General Dynamics Information Technology

10,000+ employees

🔒 Cybersecurity

🤖 Artificial Intelligence

DevSecOps Software Developer SME designing and maintaining automation and integration capabilities for cloud and software delivery environments. Enhance software delivery and reduce manual work for mission-focused solutions.

AWS

Azure

Cloud

Python