Lead Machine Learning Operations Engineer

🕒 3 days ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Paramount

Paramount

10,000+ employees

Founded 1912

📱 Media

👥 B2C

Media • B2C • Entertainment

Paramount is a global multimedia entertainment and news company that offers a range of services including direct-to-consumer digital subscription video on-demand and live streaming through Paramount+. It also owns Pluto TV, a leading free streaming television service, MTV, the world’s premier youth entertainment brand, and CBS Sports, a leader in television sports broadcasts. Paramount Pictures, since 1912, has been a legendary producer and distributor of films, hosting a library of over 1,000 titles. The company is deeply committed to inclusion and impact, focusing on diversity, global sustainability, and content that affects change. Being a significant player in both live and on-demand streaming services, Paramount embraces a wide array of content from sports to kids’ entertainment, comedy, and groundbreaking documentaries, impacting both linear and streaming platforms globally.

📋 Description

• Own ML production reliability strategy • Define and lead the operational strategy for production ML systems, including monitoring, traceability, deployment safety, incident response, and post-deployment validation. • Set the standards ML teams use to assess model health, performance, and trustworthiness in production. • Own model traceability and governance • Ensure every production model has clear lineage (data, features, code, artifacts, validation, deployment history) and drive adoption of model registry and metadata tooling across ML teams. • Build end-to-end ML observability • Design and implement monitoring across the full ML signal path: data arrival, feature freshness, distribution stability, candidate generation, ranking behavior, model metrics, serving latency, and SLA performance. • Define production health metrics • Partner with ML, data, product, and business stakeholders to define post-deployment metrics covering model quality, system reliability, business guardrails, and degradation indicators. • Detect drift and degradation proactively • Detect data drift, feature drift, model behavior changes, and silent failures before they impact customers via thresholding, alerting, anomaly detection, and release-over-release monitoring. • Lead diagnostic tooling and root-cause analysis • Build dashboards, logs, and diagnostic workflows that progress quickly from 'recommendations look off' to root cause, with context captured across candidates, features, scores, ranking decisions, and downstream outcomes. • Own ML deployment safety • Define and operate automated gates that prevent bad models or bad data from being promoted to production. • Partner with MLEs to establish validation checks, rollback criteria, canary strategies, shadow testing, and release health reviews. • Lead ML incident response • Own incident response practices for ML systems, including rollback playbooks, hotfix strategies, severity definitions, tradeoff frameworks, communications, and post-mortems. • Drive closure of systemic gaps after incidents rather than only resolving the immediate issue. • Partner across ML Platform, Data, and ML Partner with DevOps/Platform on infrastructure and observability needs; with Data Engineering on data quality, drift, and freshness; and with ML Engineering to embed operational requirements into development and deployment workflows. • Set standards and mentor others Act as the technical lead for ML operations: establish reusable patterns, playbooks, and standards, and mentor engineers on reliability, observability, and operational rigor.

🎯 Requirements

• 5+ years of experience in machine learning engineering, ML platform, applied ML, MLOps, data platform, reliability engineering, or a related technical role. • Demonstrated experience operating production ML systems, including monitoring, deployment, incident response, model validation, data quality, or reliability ownership. • Experience leading technical initiatives across multiple engineering teams, especially where success required influencing architecture, tooling, standards, or adoption. • Hands-on experience with model registries, feature stores, ML metadata systems, production monitoring, model deployment pipelines, or ML observability platforms. • Solid knowledge of end-to-end ML systems, including training data, features, model artifacts, offline validation, online serving, post-deployment metrics, and business outcome measurement. • Ability to reason about ML operational failure modes: stale features, distribution shift, training-serving skew, delayed labels, and offline-online metric gaps. • Solid SQL skills and comfort investigating data quality, feature distributions, model outputs, pipeline behavior, and production anomalies. • Track record of cross-functional collaboration with Platform, Data, and ML Engineering to deliver production-grade operational capabilities. • Solid written and verbal communication skills, including the ability to explain ML system health, risks, incidents, and tradeoffs to both technical and non-technical stakeholders.

🏖️ Benefits

• medical • dental • vision • 401(k) plan • life insurance coverage • disability benefits • tuition assistance program • PTO

Apply Now

Similar Jobs

🕒 4 days ago

NVIDIA

10,000+ employees

🤖 Artificial Intelligence

🎮 Gaming

Senior Perception Engineer at NVIDIA developing end2end solutions for autonomous driving perception. Working on deep learning models and data-driven development for real-world driving scenarios.

Python

PyTorch

🕒 4 days ago

System Inc.

11 - 50

🤖 Artificial Intelligence

🔬 Science

Data & AI/ML Engineer at System designing data pipelines and infrastructure for healthcare data products. Ensuring reliability and performance while partnering with Research and Data Science teams.

Airflow

AWS

Azure

Cloud

ETL

Google Cloud Platform

Python

Spark

SQL

🕒 4 days ago

Kodex

11 - 50

📋 Compliance

🔒 Cybersecurity

💳 Fintech

ML Engineer designing and deploying models for Kodex, transforming data workflows for secure handling. Collaborating with teams to enhance verification accuracy and improve security systems.

🇺🇸 United States – Remote

💵 $150k - $180k / year

💰 Venture Round on 2022-10

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Machine Learning Engineer

🕒 4 days ago

Instacart

1001 - 5000

🛍️ eCommerce

🚗 Transport

🛒 Retail

Senior Machine Learning Engineer II building ranking systems at Instacart. Architecting adaptive platforms for search and recommendations while mentoring ML engineers.

Pandas

Python

SQL

🕒 4 days ago

GTO Wizard

11 - 50

🎮 Gaming

📚 Education

👥 B2C

Machine Learning Scientist optimizing deep learning models at GTO Wizard, a leading poker education platform. Join a team pushing the boundaries of poker study and strategy development.