Senior Software Engineer, DGX Cloud AI Infrastructure

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

Senior Software Engineer, DGX Cloud AI Infrastructure

🕒 June 4

🏄 California, Oregon, +2 more states – Remote

💵 $184k - $356.5k / year

⏰ Full Time

🟠 Senior

🗣️ LLM Engineer

🦅 H1B Visa Sponsor

Distributed Systems

Node.js

Python

PyTorch

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

📋 Description

• Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale. • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.

🎯 Requirements

• Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience). • 8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership. • Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware. • Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale. • Proven track record of architecting, debugging, and scaling large-scale distributed systems. • Expert-level Python and C/C++ programming skills. • Experience operating workloads in scheduled, containerized cluster environments. • Excellent analytical, debugging, and communication skills, with the ability to influence across teams.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

AI Infrastructure TPM II

🕒 May 22

Astreya

1001 - 5000

🔒 Cybersecurity

🏢 Enterprise

☁️ SaaS

AI Infrastructure Project Manager leading complex infrastructure deployment for AI programs across multiple sites. Drive execution, coordination, and risk management initiatives in technical project management.

🇺🇸 United States – Remote

💵 $73k - $100.8k / year

⏰ Full Time

🟠 Senior

🔴 Lead

🗣️ LLM Engineer

🦅 H1B Visa Sponsor

PMP

AI Infrastructure TPM II

🕒 May 22

Astreya

1001 - 5000

🔒 Cybersecurity

🏢 Enterprise

☁️ SaaS

Technical Project Manager leading AI infrastructure deployment across multiple business units and sites. Delivering execution excellence through project management of GPU, compute, and storage systems.

🇺🇸 United States – Remote

💵 $73k - $100.8k / year

⏰ Full Time

🟠 Senior

🔴 Lead

🗣️ LLM Engineer

🦅 H1B Visa Sponsor

PMP

Senior Director, AI Infrastructure

🕒 May 20

ARMRA®

51 - 200

🧘 Wellness

🛍️ eCommerce

👥 B2C

Sr. Director of AI Infrastructure at ARMRA connecting business systems into a unified intelligent automation layer. Owning scalable architecture and AI playbook development across all functions.

🇺🇸 United States – Remote

💰 Series A on 2022-07

⏰ Full Time

🟠 Senior

🗣️ LLM Engineer

AWS

Cloud

ML, LLM Engineer

🕒 May 12

Codvo.ai

51 - 200

🔒 Cybersecurity

☁️ SaaS

ML Engineer developing predictive systems to evaluate product outcomes using ML and LLMs at Codvo. Responsible for building data-driven decision-making models and enhancing AI capabilities.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🗣️ LLM Engineer

AWS

Azure

Cloud

Python

PyTorch

Scikit-Learn

Tensorflow

Partner Manager – AI Infrastructure

🕒 March 23

Hydra Host

11 - 50

🔧 Hardware

🏢 Enterprise

🤖 Artificial Intelligence

Partner Manager at Hydra Host building strategic relationships within the AI infrastructure ecosystem. Focusing on partnerships to drive revenue growth and partner success.

🇺🇸 United States – Remote

💵 $100k - $130k / year

💰 $10M Seed Round on 2022-04

⏰ Full Time

🟡 Mid-level

🟠 Senior

🗣️ LLM Engineer

Cloud