Search Remote Jobs

Lead Machine Learning Operations Engineer

🔥 0 minutes ago

🗽 New York – Remote

info

💵 $157k - $235k / year

⏰ Full Time

🟠 Senior

🤖 Machine Learning Engineer

🦅 H1B Visa Sponsor

info
Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Paramount

Paramount

10,000+ employees

Founded 1912

📱 Media

👥 B2C

Media • B2C • Entertainment

Paramount is a global multimedia entertainment and news company that offers a range of services including direct-to-consumer digital subscription video on-demand and live streaming through Paramount+. It also owns Pluto TV, a leading free streaming television service, MTV, the world’s premier youth entertainment brand, and CBS Sports, a leader in television sports broadcasts. Paramount Pictures, since 1912, has been a legendary producer and distributor of films, hosting a library of over 1,000 titles. The company is deeply committed to inclusion and impact, focusing on diversity, global sustainability, and content that affects change. Being a significant player in both live and on-demand streaming services, Paramount embraces a wide array of content from sports to kids’ entertainment, comedy, and groundbreaking documentaries, impacting both linear and streaming platforms globally.

📋 Description

• Own ML production reliability strategy • Define and lead the operational strategy for production ML systems, including monitoring, traceability, deployment safety, incident response, and post-deployment validation. • Set the standards ML teams use to assess model health, performance, and trustworthiness in production. • Own model traceability and governance • Ensure every production model has clear lineage (data, features, code, artifacts, validation, deployment history) and drive adoption of model registry and metadata tooling across ML teams. • Build end-to-end ML observability • Design and implement monitoring across the full ML signal path: data arrival, feature freshness, distribution stability, candidate generation, ranking behavior, model metrics, serving latency, and SLA performance. • Define production health metrics • Partner with ML, data, product, and business stakeholders to define post-deployment metrics covering model quality, system reliability, business guardrails, and degradation indicators. • Detect drift and degradation proactively • Detect data drift, feature drift, model behavior changes, and silent failures before they impact customers via thresholding, alerting, anomaly detection, and release-over-release monitoring. • Lead diagnostic tooling and root-cause analysis • Build dashboards, logs, and diagnostic workflows that progress quickly from 'recommendations look off' to root cause, with context captured across candidates, features, scores, ranking decisions, and downstream outcomes. • Own ML deployment safety • Define and operate automated gates that prevent bad models or bad data from being promoted to production. • Partner with MLEs to establish validation checks, rollback criteria, canary strategies, shadow testing, and release health reviews. • Lead ML incident response • Own incident response practices for ML systems, including rollback playbooks, hotfix strategies, severity definitions, tradeoff frameworks, communications, and post-mortems. • Drive closure of systemic gaps after incidents rather than only resolving the immediate issue. • Partner across ML Platform, Data, and ML Partner with DevOps/Platform on infrastructure and observability needs; with Data Engineering on data quality, drift, and freshness; and with ML Engineering to embed operational requirements into development and deployment workflows. • Set standards and mentor others Act as the technical lead for ML operations: establish reusable patterns, playbooks, and standards, and mentor engineers on reliability, observability, and operational rigor.

🎯 Requirements

• 5+ years of experience in machine learning engineering, ML platform, applied ML, MLOps, data platform, reliability engineering, or a related technical role. • Demonstrated experience operating production ML systems, including monitoring, deployment, incident response, model validation, data quality, or reliability ownership. • Experience leading technical initiatives across multiple engineering teams, especially where success required influencing architecture, tooling, standards, or adoption. • Hands-on experience with model registries, feature stores, ML metadata systems, production monitoring, model deployment pipelines, or ML observability platforms. • Solid knowledge of end-to-end ML systems, including training data, features, model artifacts, offline validation, online serving, post-deployment metrics, and business outcome measurement. • Ability to reason about ML operational failure modes: stale features, distribution shift, training-serving skew, delayed labels, and offline-online metric gaps. • Solid SQL skills and comfort investigating data quality, feature distributions, model outputs, pipeline behavior, and production anomalies. • Track record of cross-functional collaboration with Platform, Data, and ML Engineering to deliver production-grade operational capabilities. • Solid written and verbal communication skills, including the ability to explain ML system health, risks, incidents, and tradeoffs to both technical and non-technical stakeholders.

🏖️ Benefits

• medical • dental • vision • 401(k) plan • life insurance coverage • disability benefits • tuition assistance program • PTO

Apply Now

Similar Jobs

🔥 3 hours ago

AvaSure

201 - 500

🤖 Artificial Intelligence

☁️ SaaS

🤝 B2B

Manager of AI/ML leading a team of machine learning engineers at AvaSure. Responsible for the architecture and execution of the ML lifecycle and ensuring production AI systems are scalable and reliable.

🔥 5 hours ago

NVIDIA

10,000+ employees

🤖 Artificial Intelligence

🎮 Gaming

Senior Perception Engineer at NVIDIA developing end2end solutions for autonomous driving perception. Working on deep learning models and data-driven development for real-world driving scenarios.

🔥 6 hours ago

System Inc.

11 - 50

🤖 Artificial Intelligence

🔬 Science

Data & AI/ML Engineer at System designing data pipelines and infrastructure for healthcare data products. Ensuring reliability and performance while partnering with Research and Data Science teams.

🔥 10 hours ago

Kodex

11 - 50

📋 Compliance

🔒 Cybersecurity

💳 Fintech

ML Engineer designing and deploying models for Kodex, transforming data workflows for secure handling. Collaborating with teams to enhance verification accuracy and improve security systems.

🇺🇸 United States – Remote

💵 $150k - $180k / year

💰 Venture Round on 2022-10

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Machine Learning Engineer

🔥 10 hours ago

Instacart

1001 - 5000

🛍️ eCommerce

🚗 Transport

🛒 Retail

Senior Machine Learning Engineer II building ranking systems at Instacart. Architecting adaptive platforms for search and recommendations while mentoring ML engineers.