Senior ML Platform Engineer

🕒 3 days ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Design, build, and maintain our core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters. • Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads. • Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, with a strong focus on software engineering best practices. • Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline their end-to-end experimentation. • Evolve and operate our multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols. • Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, driving root cause analysis and implementing preventative measures. • Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes. • Drive the adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink, etc.).

🎯 Requirements

• BS/MS in Computer Science, Engineering, or equivalent experience. • 5+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems. • Strong proficiency in Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with a proven track record of building and managing production infrastructure. • SRE-oriented mindset with extensive experience in diagnosing system-level issues, performance tuning, and ensuring platform reliability. • Solid understanding of ML workflows and lifecycle—from data preprocessing to deployment. • Proficiency in operating containerized workloads with Kubernetes and Docker. • Strong software engineering skills in languages such as Python or Go, with a focus on automation, tooling, and writing production-grade code. • Experience with Linux systems internals, networking, and performance tuning at scale.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

🕒 3 days ago

Lead Data Platform Engineer handling the technical architecture for the Enterprise Data Analytics Platform team. Driving large-scale engineering initiatives across the organization while mentoring engineers.

Amazon Redshift

Apache

BigQuery

Cloud

Distributed Systems

Java

Kafka

Python

Scala

Spark

SQL

🕒 3 days ago

Bridgeway Benefit Technologies

201 - 500

☁️ SaaS

👥 HR Tech

Senior Platform Engineer focused on architecting and maintaining Bridgeway's cloud infrastructure. Driving DevOps practices and delivering efficient platform solutions across teams.

AWS

Azure

Cloud

Docker

Firewalls

Python

SDLC

Terraform

🕒 3 days ago

NeoBIM GmbH

1 - 10

🤖 Artificial Intelligence

🏠 Real Estate

Senior Platform Engineer at neoBIM transforming the construction industry with AI-powered BIM solutions. Focused on infrastructure, system reliability, and CI/CD workflows in a collaborative environment.

AWS

Azure

Cloud

DynamoDB

Google Cloud Platform

Grafana

Linux

MongoDB

MySQL

Postgres

Prometheus

Shell Scripting

Terraform

🕒 4 days ago

MANSCAPED

201 - 500

💄 Beauty

👥 B2C

🛍️ eCommerce

Senior Systems & Platform Engineer at MANSCAPED shaping Azure-based platform architecture and enterprise application integrations. Collaborating on cloud strategy and driving critical engineering initiatives.

Azure

Cloud

Python

Terraform

TypeScript

🕒 4 days ago

Strivacity

11 - 50

🔌 API

🔒 Cybersecurity

💳 Fintech

Platform Engineer building and maintaining infrastructure for engineering teams at Strivacity. Focusing on Kubernetes, automation, and operational excellence in a remote role.

AWS

Flux

Grafana

Kubernetes

Prometheus

Python

Terraform