Senior ML Platform Engineer

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Design, build, and maintain our core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters. • Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads. • Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, with a strong focus on software engineering best practices. • Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline their end-to-end experimentation. • Evolve and operate our multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols. • Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, driving root cause analysis and implementing preventative measures. • Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes. • Drive the adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink, etc.).

🎯 Requirements

• BS/MS in Computer Science, Engineering, or equivalent experience. • 5+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems. • Strong proficiency in Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with a proven track record of building and managing production infrastructure. • SRE-oriented mindset with extensive experience in diagnosing system-level issues, performance tuning, and ensuring platform reliability. • Solid understanding of ML workflows and lifecycle—from data preprocessing to deployment. • Proficiency in operating containerized workloads with Kubernetes and Docker. • Strong software engineering skills in languages such as Python or Go, with a focus on automation, tooling, and writing production-grade code. • Experience with Linux systems internals, networking, and performance tuning at scale.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

🔥 3 hours ago

Lead Data Platform Engineer handling the technical architecture for the Enterprise Data Analytics Platform team. Driving large-scale engineering initiatives across the organization while mentoring engineers.

🔥 6 hours ago

Bridgeway Benefit Technologies

201 - 500

☁️ SaaS

👥 HR Tech

Senior Platform Engineer focused on architecting and maintaining Bridgeway's cloud infrastructure. Driving DevOps practices and delivering efficient platform solutions across teams.

🔥 6 hours ago

NeoBIM GmbH

1 - 10

🤖 Artificial Intelligence

🏠 Real Estate

Senior Platform Engineer at neoBIM transforming the construction industry with AI-powered BIM solutions. Focused on infrastructure, system reliability, and CI/CD workflows in a collaborative environment.

🔥 13 hours ago

Leidos

10,000+ employees

🔒 Cybersecurity

🔬 Science

Full Stack Power Platform Developer supporting a federal healthcare program at Leidos. Utilizing Power Platform suite for application development in a fast-paced environment.

🔥 23 hours ago

MANSCAPED

201 - 500

💄 Beauty

👥 B2C

🛍️ eCommerce

Senior Systems & Platform Engineer at MANSCAPED shaping Azure-based platform architecture and enterprise application integrations. Collaborating on cloud strategy and driving critical engineering initiatives.