Senior Site Reliability Engineer, Managed Kubernetes

Job not on LinkedIn

October 9

Apply Now
Logo of Lambda

Lambda

Artificial Intelligence • SaaS • Hardware

Lambda is a company that provides cloud-based solutions and hardware for AI development. They offer on-demand GPU clusters for multi-node training and fine-tuning, as well as inference endpoints and APIs. Their products include the Lambda GPU Cloud, which features NVIDIA's latest generation of infrastructure for enterprise AI, and customizable GPU workstations and desktops designed for AI and deep learning. Lambda also offers a one-line installation and managed upgrade path for machine learning tools like PyTorch, TensorFlow, and NVIDIA CUDA. By focusing on enabling AI developers, Lambda provides both public and private cloud services with access to powerful NVIDIA Tensor Core GPUs.

51 - 200 employees

🤖 Artificial Intelligence

☁️ SaaS

🔧 Hardware

💰 $39.7M Venture Round on 2022-11

📋 Description

• Operate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodes • Handle cluster degradation, recovery, resizing, and incident response using fleet management tools • Participate in a well-managed on-call rotation for critical incidents • Assist customers with Kubernetes questions, workload integration, storage, and authentication • Work closely with our HPC Ops and Datacenter Ops teams for low-level or cross-functional issues • Use Python and Golang to create tooling and automate the validation of platform quality. • Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes • Develop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion. • Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability.

🎯 Requirements

• 6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systems • Strong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators • Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar) • Can work either independently with limited direction or as part of a team • Can work with customers during incidents either via tickets, live messaging, or as part of a larger call. • Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines • Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar

🏖️ Benefits

• Health, dental, and vision coverage for you and your dependents • Wellness and Commuter stipends for select roles • 401k Plan with 2% company match (USA employees) • Flexible Paid Time Off Plan that we all actually use

Apply Now

Similar Jobs

October 2

Scalable

201 - 500

Cloud Engineer improving AWS Infrastructure at fintech startup. Mentoring teams in a DevOps culture and developing internal tools for cloud services.

🇩🇪 Germany – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

October 2

evoila

201 - 500

Consultant building and advising on Kubernetes developer platforms for clients at evoila, an agile cloud engineering company.

🇩🇪 Germany – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇩🇪 German Required

October 1

Mirantis

501 - 1000

🏢 Enterprise

☁️ SaaS

Kubernetes DevOps Engineer building and integrating AI infrastructure on Kubernetes for Mirantis k0rdent-ai platform.

🇩🇪 Germany – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

September 30

CENTOGENE

501 - 1000

🧬 Biotechnology

💊 Pharmaceuticals

🔬 Science

Build and maintain secure AWS infrastructure and CI/CD pipelines for CENTOGENE's genomic diagnostics. Implement IaC, containers, serverless workflows, and collaborate internationally.

🇩🇪 Germany – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

August 28

auxmoney

201 - 500

Senior Security Engineer for DevOps and Cloud Platforms at auxmoney. Embeds security in CI/CD, automates controls, ensures compliant cloud security.

🇩🇪 Germany – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇩🇪 German Required

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com