Site Reliability Engineer

Job not on LinkedIn

August 8

Apply Now
Logo of BentoML

BentoML

Artificial Intelligence • B2B • SaaS

BentoML is a flexible platform designed to deploy and manage AI/ML models and custom inference pipelines in production. It offers a unified interface for seamless deployment, scaling, and optimization of various models, including large language models (LLMs). The platform empowers users to maintain full control over their AI models by allowing deployments in any environment, whether cloud or on-premise, while ensuring security and compliance without the data ever leaving the user's infrastructure.

51 - 200 employees

Founded 2019

🤖 Artificial Intelligence

🤝 B2B

☁️ SaaS

📋 Description

• About BentoML • BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. • Join BentoML as a Senior Site Reliability Engineer and take charge of the infrastructure that delivers large language model and generative AI services worldwide. • Architect and operate Kubernetes clusters across AWS, Google Cloud, and on premises environments, turning vast GPU fleets into responsive inference pools. • Your work will span writing clean Terraform code, refining GitOps pipelines, tuning Prometheus, and leading incident response. • Set service level objectives that matter, guide teammates through complex production challenges, and build processes that keep our platform robust and fast.

🎯 Requirements

• Expert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem. • Hands-on experience with AWS and Google Cloud; familiarity with Azure or Oracle Cloud is a plus. • Strong skills with Terraform or Pulumi, GitOps tools (Argo CD, Flux, or similar), and CI/CD pipelines. • Deep understanding of Linux and networking fundamentals. • Experience managing NVIDIA GPU clusters; AMD/ROCm knowledge is a bonus. • Familiarity with specialized GPU clouds such as Lambda or Nebius is helpful. • Solid background with Prometheus and Grafana at scale. • Clear written and spoken communication and comfort working across time zones.

🏖️ Benefits

• Remote work – work from where you are most productive and collaborate with teammates in North America and Asia. • Technical scope – operate distributed LLM inference and large GPU clusters worldwide. • Customer reach – support organizations around the globe that rely on BentoML. • Influence – lead SRE practices and infrastructure choices. • Compensation – competitive salary, equity, learning budget, and paid conference travel.

Apply Now
Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com