Lead Machine Learning Operations Engineer

10,000+ employees

Founded 1912

💼 Consulting

📣 Marketing

📱 Media

Consulting • Marketing • Media

Paramount is a global multimedia entertainment and news company that offers a range of services including direct-to-consumer digital subscription video on-demand and live streaming through Paramount+. It also owns Pluto TV, a leading free streaming television service, MTV, the world’s premier youth entertainment brand, and CBS Sports, a leader in television sports broadcasts. Paramount Pictures, since 1912, has been a legendary producer and distributor of films, hosting a library of over 1,000 titles. The company is deeply committed to inclusion and impact, focusing on diversity, global sustainability, and content that affects change. Being a significant player in both live and on-demand streaming services, Paramount embraces a wide array of content from sports to kids’ entertainment, comedy, and groundbreaking documentaries, impacting both linear and streaming platforms globally.

Lead Machine Learning Operations Engineer

🕒 June 16

🏄 California, New York – Remote

💵 $157k - $235k / year

⏰ Full Time

🟠 Senior

🤖 Machine Learning Engineer

🦅 H1B Visa Sponsor

SQL

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Paramount

10,000+ employees

Founded 1912

💼 Consulting

📣 Marketing

📱 Media

Consulting • Marketing • Media

📋 Description

• Own ML production reliability strategy • Define and lead the operational strategy for production ML systems, including monitoring, traceability, deployment safety, incident response, and post-deployment validation. • Set the standards ML teams use to assess model health, performance, and trustworthiness in production. • Own model traceability and governance • Ensure every production model has clear lineage (data, features, code, artifacts, validation, deployment history) and drive adoption of model registry and metadata tooling across ML teams. • Build end-to-end ML observability • Design and implement monitoring across the full ML signal path: data arrival, feature freshness, distribution stability, candidate generation, ranking behavior, model metrics, serving latency, and SLA performance. • Define production health metrics • Partner with ML, data, product, and business stakeholders to define post-deployment metrics covering model quality, system reliability, business guardrails, and degradation indicators. • Detect drift and degradation proactively • Detect data drift, feature drift, model behavior changes, and silent failures before they impact customers via thresholding, alerting, anomaly detection, and release-over-release monitoring. • Lead diagnostic tooling and root-cause analysis • Build dashboards, logs, and diagnostic workflows that progress quickly from “recommendations look off” to root cause, with context captured across candidates, features, scores, ranking decisions, and downstream outcomes. • Own ML deployment safety • Define and operate automated gates that prevent bad models or bad data from being promoted to production. • Partner with MLEs to establish validation checks, rollback criteria, canary strategies, shadow testing, and release health reviews. • Lead ML incident response • Own incident response practices for ML systems, including rollback playbooks, hotfix strategies, severity definitions, tradeoff frameworks, communications, and post-mortems. • Drive closure of systemic gaps after incidents rather than only resolving the immediate issue. • Partner across ML Platform, Data, and ML • Partner with DevOps/Platform on infrastructure and observability needs; with Data Engineering on data quality, drift, and freshness; and with ML Engineering to embed operational requirements into development and deployment workflows. • Set standards and mentor others • Act as the technical lead for ML operations: establish reusable patterns, playbooks, and standards, and mentor engineers on reliability, observability, and operational rigor.

🎯 Requirements

• 5+ years of experience in machine learning engineering, ML platform, applied ML, MLOps, data platform, reliability engineering, or a related technical role. • Demonstrated experience operating production ML systems, including monitoring, deployment, incident response, model validation, data quality, or reliability ownership. • Experience leading technical initiatives across multiple engineering teams, especially where success required influencing architecture, tooling, standards, or adoption. • Hands-on experience with model registries, feature stores, ML metadata systems, production monitoring, model deployment pipelines, or ML observability platforms. • Solid knowledge of end-to-end ML systems, including training data, features, model artifacts, offline validation, online serving, post-deployment metrics, and business outcome measurement. • Ability to reason about ML operational failure modes: stale features, distribution shift, training-serving skew, delayed labels, and offline-online metric gaps. • Solid SQL skills and comfort investigating data quality, feature distributions, model outputs, pipeline behavior, and production anomalies. • Track record of cross-functional collaboration with Platform, Data, and ML Engineering to deliver production-grade operational capabilities. • Solid written and verbal communication skills, including the ability to explain ML system health, risks, incidents, and tradeoffs to both technical and non-technical stakeholders.

🏖️ Benefits

• medical • dental • vision • 401(k) plan • life insurance coverage • disability benefits • tuition assistance program • PTO

Apply Now

Similar Jobs

Senior Software Engineer, Machine Learning Inference Platform

🕒 June 15

Stack AV

51 - 200

📦 Logistics

🏭 Manufacturing

💼 Consulting

Senior Engineer responsible for technical design and delivery within an AI inference platform. Collaborating with teams on system performance and model onboarding for the autonomous transportation sector.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

🤖 Machine Learning Engineer

Distributed Systems

GRPC

Python

PyTorch

Rust

ML Engineer

🕒 June 15

Docker, Inc

51 - 200

💼 Consulting

☁️ SaaS

ML Engineer developing intelligence-driven product capabilities for Docker's platform. Collaborating with founding engineers to shape technical direction and build ML systems that enhance security and governance.

🇺🇸 United States – Remote

💵 $138.5k - $225.5k / year

💰 $105M Series C on 2022-03

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Machine Learning Engineer

Senior Staff Machine Learning Engineer

🕒 June 13

Workiva

1001 - 5000

💼 Consulting

🏥 Healthcare

📦 Logistics

Senior Staff Machine Learning Engineer defining how AI is architected and deployed across Workiva’s platform. Leading design and implementation of enterprise AI systems for mission-critical workflows.

🇺🇸 United States – Remote

💵 $193k - $308k / year

⏰ Full Time

🟠 Senior

🤖 Machine Learning Engineer

🦅 H1B Visa Sponsor

Cloud

Distributed Systems

Java

Python

Scala

Machine Learning Engineer

🕒 June 12

Local Infusion

1 - 10

🏥 Healthcare

Machine Learning Engineer at Local Infusion building AI-driven technology to enhance specialty infusion care. Develop models for operational efficiency and patient treatment acceleration.

🇺🇸 United States – Remote

💰 $4M Seed Round on 2022-11

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Machine Learning Engineer

AWS

Cloud

Pandas

Python

PyTorch

Scikit-Learn

SQL

Tensorflow

Machine Learning Engineer – Platform

🕒 June 12

Artera.net

11 - 50

💼 Consulting

📦 Logistics

📣 Marketing