Staff Machine Learning Systems Engineer – MLOps

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of hims & hers

hims & hers

201 - 500 employees

Founded 2017

⚕️ Healthcare Insurance

🛍️ eCommerce

🧘 Wellness

Healthcare Insurance • eCommerce • Wellness

Hims & Hers is an online platform with over 1 million subscribers that connects patients to licensed healthcare professionals across all 50 states in the U. S. The service offers comprehensive support for sexual health, weight loss, hair regrowth, mental health, and skincare through a 100% online process. Clients can receive personalized treatment plans which may include prescription medications, and benefit from free and discreet shipping. Hims & Hers prides itself on providing accessible and affordable healthcare without the need for insurance, offering transparent pricing and support from licensed providers. It aims to empower individuals by making healthcare and treatment conveniently available on their own terms, including mental health support, which includes treatments for anxiety and depression.

📋 Description

• Own and scale the AI compute and deployment platform • Own and evolve our containerized application deployment platform and related systems for AI workloads, encompassing general process and job orchestration (e.g. Kubernetes) — cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production. • Build and maintain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that let teams ship AI services safely and repeatably. • Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines so teams can validate AI changes in production-like conditions before release. • Drive efficiency and cost management across compute, autoscaling, and inference infrastructure. • Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g. Bedrock, Vertex, and other providers) — including credentials, rate limits, and failover. • Build reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level. • Create reusable infrastructure abstractions and contracts that standardize how AI services are deployed, configured, and consumed across the company. • Own the LLM/AI observability and tracing stack — provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g. ClickHouse) — so AI behavior is auditable and debuggable in production. • Build analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholders. • Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure; lead troubleshooting and continuously raise platform reliability. • Own and improve the monorepo build system and CI/CD pipelines for AI workloads — including eval workflows, Docker image builds, automated PR checks and convention enforcement, and cross-platform test execution. • Own shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) that AI and product engineers use daily. • Identify and eliminate platform bottlenecks — reducing CI/CD cycle times, build latency, and deployment friction — to improve developer velocity across the Applied AI organization. • Build IAM, OIDC, and secrets management as first-class infrastructure — scoped, least-privilege roles, write-only secret rotation, and cross-account access audits. • Encode security-by-default, scope boundaries, and access controls into the platform so AI services are HIPAA-compliant and privacy-first. • Partner with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant, auditable data access. • Drive multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and observability evolution. • Write and lead technical design documents and design reviews, define infrastructure standards and development-workflow conventions, and contribute to technical governance across AI engineering. • Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, and bridge the gap between prototypes and production-grade systems.

🎯 Requirements

• 8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years focused on ML/AI systems in production. • Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration. • Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access. • Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines. • 2+ years of experience operating LLM-based systems in production (LLMOps) — inference routing, serving, tracing, and the reliability patterns needed to run them at scale. • Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines. • Experience designing and maintaining CI/CD pipelines, build systems, and developer tooling for fast-moving engineering teams. • A systems-and-operations mindset: you think about failure modes, SLOs, observability, security, and long-term maintainability before shipping. • Experience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiatives. • Strong collaboration skills across engineering, ML, product, security, and clinical teams. • A deep appreciation for safety, privacy, and security — ideally with experience in a regulated domain such as healthcare, fintech, or life sciences.

🏖️ Benefits

• Competitive salary & equity compensation for full-time roles • Unlimited PTO, company holidays, and quarterly mental health days • Comprehensive health benefits including medical, dental & vision, and parental leave • Employee Stock Purchase Program (ESPP) • 401k benefits with employer matching contribution • Offsite team retreats

Apply Now

Similar Jobs

🔥 3 hours ago

Ensono

1001 - 5000

AIX Expert System Engineer responsible for IBM AIX and Power Systems architecture design and lifecycle management. Lead migrations, optimization, and ensure resiliency for critical workloads.

🔥 13 hours ago

Hewlett Packard Enterprise

10,000+ employees

🏢 Enterprise

🔧 Hardware

☁️ SaaS

Network Systems Engineering Manager managing presales resources for HPE's Higher Ed and local government business. Collaborating across teams in strategic opportunities for State and Higher Ed customers.

🔥 16 hours ago

Axiado Corporation

51 - 200

🔒 Cybersecurity

🤖 Artificial Intelligence

🔧 Hardware

System Engineer specializing in AI Server Bring-up to integrate Secure AI™ engine into high-performance computing environments for Axiado. Leveraging Linux expertise to build secure server management solutions.

🔥 18 hours ago

Hewlett Packard Enterprise

10,000+ employees

🏢 Enterprise

🔧 Hardware

☁️ SaaS

Network Systems Engineering Manager overseeing pre-sales systems engineers for Mid Atlantic region. Leading a team that supports State and Higher Ed customers to achieve revenue objectives.

🕒 Yesterday

Reddit, Inc.

501 - 1000

👥 B2C

📱 Media

🌍 Social Impact

Staff Machine Learning Engineer leading large-scale machine learning systems and AI-driven innovations at Reddit. Mentoring engineers and executing ML strategies for enhancing recommendations and personalization.