Staff Machine Learning Systems Engineer – MLOps

🕒 Ontem

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $210.000 - $250.000 / ano

⏰ Tempo Integral

🔴 Especialista

⚙️ Engenheiro de Sistemas

🗣️🇺🇸🇬🇧 Inglês obrigatório

Candidatar-se
Encontrar Vagas Remotas Similares

📊 Verifique sua pontuação de currículo para esta vaga

Melhore suas chances de conseguir uma entrevista verificando sua pontuação de currículo antes de se candidatar.

Logo of hims & hers

hims & hers

201 - 500 funcionários

Fundada em 2017

⚕️ Seguro de Saúde

🛍️ Comércio Eletrônico

🧘 Bem-estar

Healthcare Insurance • eCommerce • Wellness

A Hims & Hers é uma plataforma de saúde 100% online que fornece atendimento personalizado e consultas para saúde sexual, crescimento capilar, perda de peso, cuidados com a pele e saúde mental. Com mais de 2 milhões de assinantes, a Hims & Hers conecta pacientes a profissionais de saúde licenciados, oferecendo medicamentos prescritos e suporte contínuo entregues diretamente em suas casas. A plataforma tem como objetivo tornar o atendimento médico acessível, econômico e discreto, garantindo que os usuários possam gerenciar seu tratamento convenientemente através de um aplicativo.

Descrição

• Own and scale the AI compute and deployment platform • Own and evolve our containerized application deployment platform and related systems for AI workloads, encompassing general process and job orchestration (e.g. Kubernetes) — cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production. • Build and maintain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that let teams ship AI services safely and repeatably. • Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines so teams can validate AI changes in production-like conditions before release. • Drive efficiency and cost management across compute, autoscaling, and inference infrastructure. • Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g. Bedrock, Vertex, and other providers) — including credentials, rate limits, and failover. • Build reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level. • Create reusable infrastructure abstractions and contracts that standardize how AI services are deployed, configured, and consumed across the company. • Own the LLM/AI observability and tracing stack — provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g. ClickHouse) — so AI behavior is auditable and debuggable in production. • Build analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholders. • Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure; lead troubleshooting and continuously raise platform reliability. • Own and improve the monorepo build system and CI/CD pipelines for AI workloads — including eval workflows, Docker image builds, automated PR checks and convention enforcement, and cross-platform test execution. • Own shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) that AI and product engineers use daily. • Identify and eliminate platform bottlenecks — reducing CI/CD cycle times, build latency, and deployment friction — to improve developer velocity across the Applied AI organization. • Build IAM, OIDC, and secrets management as first-class infrastructure — scoped, least-privilege roles, write-only secret rotation, and cross-account access audits. • Encode security-by-default, scope boundaries, and access controls into the platform so AI services are HIPAA-compliant and privacy-first. • Partner with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant, auditable data access. • Drive multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and observability evolution. • Write and lead technical design documents and design reviews, define infrastructure standards and development-workflow conventions, and contribute to technical governance across AI engineering. • Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, and bridge the gap between prototypes and production-grade systems.

🎯 Requisitos

• 8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years focused on ML/AI systems in production. • Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration. • Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access. • Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines. • 2+ years of experience operating LLM-based systems in production (LLMOps) — inference routing, serving, tracing, and the reliability patterns needed to run them at scale. • Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines. • Experience designing and maintaining CI/CD pipelines, build systems, and developer tooling for fast-moving engineering teams. • A systems-and-operations mindset: you think about failure modes, SLOs, observability, security, and long-term maintainability before shipping. • Experience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiatives. • Strong collaboration skills across engineering, ML, product, security, and clinical teams. • A deep appreciation for safety, privacy, and security — ideally with experience in a regulated domain such as healthcare, fintech, or life sciences.

🏖️ Benefícios

• Competitive salary & equity compensation for full-time roles • Unlimited PTO, company holidays, and quarterly mental health days • Comprehensive health benefits including medical, dental & vision, and parental leave • Employee Stock Purchase Program (ESPP) • 401k benefits with employer matching contribution • Offsite team retreats

Candidatar-se

Vagas Similares

🕒 Ontem

Ensono

1001 - 5000

AIX Expert System Engineer responsible for IBM AIX and Power Systems architecture design and lifecycle management. Lead migrations, optimization, and ensure resiliency for critical workloads.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $125.000 - $163.000 / ano

⏰ Tempo Integral

🟠 Sênior

🔴 Especialista

⚙️ Engenheiro de Sistemas

🦅 Patrocina Visto H1B

info

🗣️🇺🇸🇬🇧 Inglês obrigatório

Ansible

Perl

Python

Shell Scripting

🕒 2 dias atrás

Hewlett Packard Enterprise

10.000+ funcionários

🏢 Corporativo

🔧 Hardware

☁️ SaaS

Network Systems Engineering Manager managing presales resources for HPE's Higher Ed and local government business. Collaborating across teams in strategic opportunities for State and Higher Ed customers.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $194.500 - $456.500 / ano

⏰ Tempo Integral

🟠 Sênior

🔴 Especialista

⚙️ Engenheiro de Sistemas

🦅 Patrocina Visto H1B

info

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 2 dias atrás

Axiado Corporation

51 - 200

🔒 Cibersegurança

🤖 Inteligência Artificial

🔧 Hardware

System Engineer specializing in AI Server Bring-up to integrate Secure AI™ engine into high-performance computing environments for Axiado. Leveraging Linux expertise to build secure server management solutions.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $100.000 - $180.000 / ano

💰 $25.000.000 Series B em 2021-02

⏰ Tempo Integral

🟠 Sênior

🔴 Especialista

⚙️ Engenheiro de Sistemas

🦅 Patrocina Visto H1B

info

🗣️🇺🇸🇬🇧 Inglês obrigatório

Assembly

Linux

Python

Shell Scripting

🕒 2 dias atrás

Hewlett Packard Enterprise

10.000+ funcionários

🏢 Corporativo

🔧 Hardware

☁️ SaaS

Network Systems Engineering Manager overseeing pre-sales systems engineers for Mid Atlantic region. Leading a team that supports State and Higher Ed customers to achieve revenue objectives.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $194.500 - $456.500 / ano

⏰ Tempo Integral

🟠 Sênior

🔴 Especialista

⚙️ Engenheiro de Sistemas

🦅 Patrocina Visto H1B

info

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 2 dias atrás

Reddit, Inc.

501 - 1000

👥 B2C

📱 Mídia

🌍 Impacto Social

Staff Machine Learning Engineer leading large-scale machine learning systems and AI-driven innovations at Reddit. Mentoring engineers and executing ML strategies for enhancing recommendations and personalization.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $253.300 - $354.600 / ano

⏰ Tempo Integral

🔴 Especialista

⚙️ Engenheiro de Sistemas

🗣️🇺🇸🇬🇧 Inglês obrigatório