Senior MLOps Engineer

Job not on LinkedIn

October 2

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Identify infrastructure and software bottlenecks to improve ML job startup time, data load/write time, resiliency, and failure recovery • Translate research workflows into automated, scalable, and reproducible systems that accelerate experimentation • Build CI/CD workflows tailored for ML to support data preparation, model training, validation, deployment, and monitoring • Develop observability frameworks to monitor performance, utilization, and health of large-scale training clusters • Collaborate with hardware and platform teams to optimize models for emerging GPU architectures, interconnects, and storage technologies • Develop guidelines for dataset versioning, experiment tracking, and model governance to ensure reliability and compliance • Mentor and guide engineering and research partners on MLOps patterns, scaling NVIDIA’s impact from research to production • Collaborate with NVIDIA Research teams and the DGX Cloud Customer Success team to enhance MLOps automation continuously

🎯 Requirements

• BS in Computer Science, Information Systems, Computer Engineering or equivalent experience • 8+ years of experience in large-scale software or infrastructure systems, with 5+ years dedicated to ML platforms or MLOps • Proven track record designing and operating ML infrastructure for production training workloads • Expert knowledge of distributed training frameworks (PyTorch, TensorFlow, JAX) and orchestration systems (Kubernetes, Slurm, Kubeflow, Airflow, MLflow) • Strong programming experience in Python plus at least one systems language (Go, C++, Rust) • Deep understanding of GPU scheduling, container orchestration, and cloud-native environments • Experience integrating observability stacks (Prometheus, Grafana, ELK) with ML workloads • Familiarity with storage and data platforms that support large-scale training (object stores, feature stores, versioned datasets) • Strong communication abilities, collaborating effectively with research teams to transform requirements into scalable engineering solutions

🏖️ Benefits

• Equity • Benefits

Apply Now

Similar Jobs

October 2

Machine Learning Engineer developing intelligent automation and fraud detection for Experian. Building workflows and integrating LLMs for enhanced client engagement and analytics.

AWS

Docker

Kubernetes

Microservices

Python

September 30

Build and deploy AI/ML document parsers and classifiers for structured finance. Collaborate across product, engineering, and design at dv01.

BigQuery

Cloud

Docker

Flask

Google Cloud Platform

Kubernetes

Python

PyTorch

SQL

Tensorflow

September 29

Senior ML Engineer building demand forecasts and vehicle positioning models for Lime's shared e-bikes and scooters. Scale ML systems and collaborate with cross-functional teams.

Pandas

Python

PyTorch

Spark

SQL

Tensorflow

September 28

Senior ML Engineer building scalable Ray/Kubernetes ML infrastructure and deployment for Samsara's Connected Operations Cloud, optimizing models and supporting ML platform reliability.

Java

Kubernetes

Python

PyTorch

Ray

Scala

Spark

Tensorflow

Go

September 24

Senior ML consultant at OneSix leading design, training, and production deployment of ML models. Mentors teams and shapes project scopes for enterprise AI initiatives.

AWS

Azure

Cloud

Google Cloud Platform

Python

PyTorch

Scikit-Learn

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com