Senior Software Engineer, DGX Cloud AI Infrastructure

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale. • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.

🎯 Requirements

• Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience). • 8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership. • Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware. • Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale. • Proven track record of architecting, debugging, and scaling large-scale distributed systems. • Expert-level Python and C/C++ programming skills. • Experience operating workloads in scheduled, containerized cluster environments. • Excellent analytical, debugging, and communication skills, with the ability to influence across teams.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

🕒 May 22

Astreya

1001 - 5000

🔒 Cybersecurity

🏢 Enterprise

☁️ SaaS

AI Infrastructure Project Manager leading complex infrastructure deployment for AI programs across multiple sites. Drive execution, coordination, and risk management initiatives in technical project management.

🕒 May 22

Astreya

1001 - 5000

🔒 Cybersecurity

🏢 Enterprise

☁️ SaaS

Technical Project Manager leading AI infrastructure deployment across multiple business units and sites. Delivering execution excellence through project management of GPU, compute, and storage systems.

🕒 May 20

Armada

51 - 200

📡 Telecommunications

🤖 Artificial Intelligence

🏢 Enterprise

As AI Infrastructure Supply Chain Lead, strategic architect for global AI supply chain at Armada. Leading sourcing for HPC and sovereign AI cloud platforms.

🕒 May 20

ARMRA®

51 - 200

🧘 Wellness

🛍️ eCommerce

👥 B2C

Sr. Director of AI Infrastructure at ARMRA connecting business systems into a unified intelligent automation layer. Owning scalable architecture and AI playbook development across all functions.

🕒 May 14

Subquadratic

11 - 50

🤖 Artificial Intelligence

Product Marketing Manager defining market narratives and positioning for SubQ’s AI infrastructure products. Collaborating with cross-functional teams to drive product launches and technical messaging strategy.