Lead Machine Learning Operations Engineer

🔥 21 minutes ago

🏄 California, New York – Remote

info

💵 $157k - $235k / year

⏰ Full Time

🟠 Senior

🤖 Machine Learning Engineer

🦅 H1B Visa Sponsor

info
Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Paramount

Paramount

10,000+ employees

Founded 1912

📱 Media

👥 B2C

Media • B2C • Entertainment

Paramount is a global multimedia entertainment and news company that offers a range of services including direct-to-consumer digital subscription video on-demand and live streaming through Paramount+. It also owns Pluto TV, a leading free streaming television service, MTV, the world’s premier youth entertainment brand, and CBS Sports, a leader in television sports broadcasts. Paramount Pictures, since 1912, has been a legendary producer and distributor of films, hosting a library of over 1,000 titles. The company is deeply committed to inclusion and impact, focusing on diversity, global sustainability, and content that affects change. Being a significant player in both live and on-demand streaming services, Paramount embraces a wide array of content from sports to kids’ entertainment, comedy, and groundbreaking documentaries, impacting both linear and streaming platforms globally.

📋 Description

• Own ML production reliability strategy • Define and lead the operational strategy for production ML systems, including monitoring, traceability, deployment safety, incident response, and post-deployment validation. • Set the standards ML teams use to assess model health, performance, and trustworthiness in production. • Own model traceability and governance • Ensure every production model has clear lineage (data, features, code, artifacts, validation, deployment history) and drive adoption of model registry and metadata tooling across ML teams. • Build end-to-end ML observability • Design and implement monitoring across the full ML signal path: data arrival, feature freshness, distribution stability, candidate generation, ranking behavior, model metrics, serving latency, and SLA performance. • Define production health metrics • Partner with ML, data, product, and business stakeholders to define post-deployment metrics covering model quality, system reliability, business guardrails, and degradation indicators. • Detect drift and degradation proactively • Detect data drift, feature drift, model behavior changes, and silent failures before they impact customers via thresholding, alerting, anomaly detection, and release-over-release monitoring. • Lead diagnostic tooling and root-cause analysis • Build dashboards, logs, and diagnostic workflows that progress quickly from “recommendations look off” to root cause, with context captured across candidates, features, scores, ranking decisions, and downstream outcomes. • Own ML deployment safety • Define and operate automated gates that prevent bad models or bad data from being promoted to production. • Partner with MLEs to establish validation checks, rollback criteria, canary strategies, shadow testing, and release health reviews. • Lead ML incident response • Own incident response practices for ML systems, including rollback playbooks, hotfix strategies, severity definitions, tradeoff frameworks, communications, and post-mortems. • Drive closure of systemic gaps after incidents rather than only resolving the immediate issue. • Partner across ML Platform, Data, and ML • Partner with DevOps/Platform on infrastructure and observability needs; with Data Engineering on data quality, drift, and freshness; and with ML Engineering to embed operational requirements into development and deployment workflows. • Set standards and mentor others • Act as the technical lead for ML operations: establish reusable patterns, playbooks, and standards, and mentor engineers on reliability, observability, and operational rigor.

🎯 Requirements

• 5+ years of experience in machine learning engineering, ML platform, applied ML, MLOps, data platform, reliability engineering, or a related technical role. • Demonstrated experience operating production ML systems, including monitoring, deployment, incident response, model validation, data quality, or reliability ownership. • Experience leading technical initiatives across multiple engineering teams, especially where success required influencing architecture, tooling, standards, or adoption. • Hands-on experience with model registries, feature stores, ML metadata systems, production monitoring, model deployment pipelines, or ML observability platforms. • Solid knowledge of end-to-end ML systems, including training data, features, model artifacts, offline validation, online serving, post-deployment metrics, and business outcome measurement. • Ability to reason about ML operational failure modes: stale features, distribution shift, training-serving skew, delayed labels, and offline-online metric gaps. • Solid SQL skills and comfort investigating data quality, feature distributions, model outputs, pipeline behavior, and production anomalies. • Track record of cross-functional collaboration with Platform, Data, and ML Engineering to deliver production-grade operational capabilities. • Solid written and verbal communication skills, including the ability to explain ML system health, risks, incidents, and tradeoffs to both technical and non-technical stakeholders.

🏖️ Benefits

• medical • dental • vision • 401(k) plan • life insurance coverage • disability benefits • tuition assistance program • PTO

Apply Now

Similar Jobs

🔥 33 minutes ago

Airbus

10,000+ employees

🚀 Aerospace

HR AI/ML Engineer developing and deploying AI solutions for HR data management at Airbus. Collaborating with teams to improve HR processes through innovative technology.

🔥 3 hours ago

Path Robotics

201 - 500

🤖 Artificial Intelligence

🔧 Hardware

Sr. ML Engineer designing and deploying reinforcement learning algorithms at Path Robotics. Collaborating with cross-functional teams on robotic control and adaptive behaviors in dynamic environments.

🔥 6 hours ago

Stack AV

51 - 200

🚗 Transport

🤖 Artificial Intelligence

Senior Engineer responsible for technical design and delivery within an AI inference platform. Collaborating with teams on system performance and model onboarding for the autonomous transportation sector.

🔥 10 hours ago

Docker, Inc

51 - 200

ML Engineer developing intelligence-driven product capabilities for Docker's platform. Collaborating with founding engineers to shape technical direction and build ML systems that enhance security and governance.

🇺🇸 United States – Remote

💵 $138.5k - $225.5k / year

💰 $105M Series C on 2022-03

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Machine Learning Engineer

🕒 Yesterday

DevIQ

11 - 50

☁️ SaaS

🤖 Artificial Intelligence

Senior AI/Machine Learning Engineer at DevIQ designing and deploying AI solutions. Collaborating with clients to address real business problems and ensuring effective model delivery.