Staff Site Reliability Engineer – Volcano

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Kong Inc.

Kong Inc.

201 - 500 employees

Founded 2017

🔌 API

☁️ SaaS

🏢 Enterprise

💰 $100M Series D on 2021-02

API • SaaS • Enterprise

Kong Inc. is a company that provides a comprehensive API platform designed to facilitate API management, AI integration, and developer productivity. It offers solutions like Kong Gateway, Kong Konnect, and a variety of other tools targeted at managing and optimizing the API lifecycle. Kong's platform supports multi-cloud environments and is built to deliver high performance and security. It is notably recognized by Gartner as a leader in API management and supports innovations across industries like financial services, healthcare, and technology. The company emphasizes flexibility, security, and speed, making it a favored choice for enterprises looking to enhance their digital services through APIs. Kong also supports a robust community of developers and provides extensive integrations and plugins to streamline API management and operations.

📋 Description

• Own reliability for Volcano end-to-end: Define and drive SLOs, error budgets, and incident response practices for all Volcano services — edge deployments, managed Postgres, auth, realtime, storage, and the control plane. • Architect the platform's infrastructure: Design and build the multi-region Kubernetes infrastructure, networking, and data plane that powers Volcano's edge deployment pipeline and backend-as-a-service capabilities. • Build the GitOps and CI/CD backbone: Establish deployment automation, canary pipelines, and preview environment provisioning using ArgoCD, Helm, and Terraform/Terragrunt — setting patterns the broader team will follow. • Scale managed data services: Design, operate, and harden multi-tenant PostgreSQL clusters, Redis caching layers, and object storage — with a focus on data isolation, performance, and disaster recovery. • Drive observability from day one: Instrument every Volcano service with meaningful SLIs; build dashboards, alerts, and runbooks using Datadog, Prometheus, and Grafana before services go live, not after incidents. • Lead cross-functional reliability work: Collaborate with the OCTO team, product engineering, and security to bake reliability and compliance into Volcano's architecture — not bolt it on later. • Set SRE culture and standards: Mentor engineers across Volcano's contributing teams on reliability principles; lead postmortems, define on-call practices, and build a blameless engineering culture. • Evaluate and adopt emerging technologies: Given Volcano's greenfield nature, evaluate and make architectural decisions on edge runtimes, serverless compute, vector databases, and AI-native infrastructure components.

🎯 Requirements

• BS in Computer Science or equivalent; substantial experience at Staff or Principal IC level in SRE/Platform Engineering. • Proven track record building SRE or platform engineering practices for developer-facing platforms or PaaS/SaaS products — ideally at greenfield stage. • Deep Kubernetes expertise: multi-tenant cluster design, networking (CNI, service mesh, ingress), autoscaling, and security hardening.

🏖️ Benefits

• healthcare benefits • 401(k) plan • short and long term disability benefits • basic life and AD&D insurance

Apply Now

Similar Jobs

🕒 2 days ago

Gorilla Logic

501 - 1000

☁️ SaaS

🏢 Enterprise

🤖 Artificial Intelligence

Technical Engineering Manager leading high-performing cloud and DevOps teams. Guiding architecture and delivery of scalable, reliable, and secure cloud solutions for clients.

🕒 3 days ago

General Dynamics Information Technology

10,000+ employees

🔒 Cybersecurity

🤖 Artificial Intelligence

DevSecOps Engineer developing and operating security automation platforms for Department of Defense and Federal customers. Focus on hands-on software development within a DevSecOps context.

🕒 4 days ago

Calix

1001 - 5000

📡 Telecommunications

☁️ SaaS

🏢 Enterprise

Staff Site Reliability Engineer leading global platform reliability and observability strategy at Calix. Driving architecture and optimizations of enterprise-grade systems across distributed environments and cloud infrastructure.

🕒 5 days ago

ClassWallet

11 - 50

💳 Fintech

📚 Education

🏛️ Government

DevOps Engineer optimizing cloud infrastructure and deployment pipelines for fintech company. Redefining public funds management and ensuring system reliability with high compliance standards.

🕒 6 days ago

Domino Data Lab

201 - 500

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Staff Site Reliability Engineer working on AI-assisted reliability tooling at Domino Data Lab. Leading incident response and enhancing system observability for critical services.