Senior Site Reliability Engineer - AI Studio

Job not on LinkedIn

June 6

Apply Now
Logo of Nebius Group

Nebius Group

AI • Enterprise • SaaS

Nebius Group is building one of the world’s leading AI infrastructure companies, focusing on providing the necessary compute, storage, and tools for developers in the AI space. Based in Europe and listed on Nasdaq, Nebius has a global presence with R&D centers across Europe, North America, and Israel. The company's primary offering is an AI-centric cloud platform designed for intensive AI workloads, complemented by various other businesses involved in generative AI development, edtech, and autonomous technology.

1001 - 5000 employees

🏢 Enterprise

☁️ SaaS

📋 Description

•Own the reliability, performance, and observability of the entire inference stack. •Design and refine telemetry pipelines — metrics, logs, and traces. •Tune Kubernetes autoscalers and craft Terraform modules. •Harden request-routing and retry logic for resilience. •Use automation and runbooks to detect, isolate, and remediate issues. •Drive post-mortem culture to prevent recurrence. •Scale the platform while meeting cost and reliability targets.

🎯 Requirements

•Deep fluency with Kubernetes, Prometheus, Grafana, Terraform, and the craft of infrastructure-as-code. •Ability to script comfortably in Python or Bash. •Understanding the nuances of alert design and SLOs for high-throughput APIs. •Experience with GPU-heavy workloads — whether with vLLM, Triton, Ray, or another accelerator stack. •Background in MLOps or model-hosting platforms is a plus. •Ability to build self-healing systems and collaborate with software engineers.

🏖️ Benefits

•Competitive salary and comprehensive benefits package. •Opportunities for professional growth within Nebius. •Hybrid working arrangements. •A dynamic and collaborative work environment that values initiative and innovation.

Apply Now

Similar Jobs

May 25

KPN

10,000+ employees

📡 Telecommunications

🛍️ eCommerce

🔒 Cybersecurity

As a DevOps engineer, design and manage Mobile Core systems and CI/CD pipelines at KPN's OSS & Tooling team.

🇳🇱 Netherlands – Remote

💵 €4.5k - €6.6k / month

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

May 13

KPN

10,000+ employees

📡 Telecommunications

🛍️ eCommerce

🔒 Cybersecurity

Join KPN's Tech Hub as a Senior DevOps Engineer, focusing on network automation and innovation.

🇳🇱 Netherlands – Remote

💵 €5k - €7.6k / month

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇳🇱 Dutch Required

May 12

KPN

10,000+ employees

📡 Telecommunications

🛍️ eCommerce

🔒 Cybersecurity

As a DevOps engineer, enhance the Mobile Core systems and CI/CD for KPN's telecommunication services.

🇳🇱 Netherlands – Remote

💵 €4.5k - €6.6k / month

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇳🇱 Dutch Required

February 16

Detron

201 - 500

Join Esprit ICT Consultancy as a DevOps Engineer to automate processes and improve user experience.

🇳🇱 Netherlands – Remote

💵 €4k - €7k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇳🇱 Dutch Required

February 16

Detron

201 - 500

As a DevOps Engineer, automate processes in various large organizations with Esprit ICT.

🇳🇱 Netherlands – Remote

💵 €4k - €7k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇳🇱 Dutch Required

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com