
10,000+ employees
Founded 1993
🤖 Artificial Intelligence
🎮 Gaming
Artificial Intelligence • Gaming • Automotive
NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.
🔥 0 minutes ago
🏄 California, Oregon, +2 more states – Remote
💵 $184k - $356.5k / year
⏰ Full Time
🟠 Senior
🗣️ LLM Engineer
🦅 H1B Visa Sponsor
Improve your chances of getting an interview by checking your resume score before you apply.

10,000+ employees
Founded 1993
🤖 Artificial Intelligence
🎮 Gaming
Artificial Intelligence • Gaming • Automotive
NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.
• Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale. • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.
• Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience). • 8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership. • Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware. • Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale. • Proven track record of architecting, debugging, and scaling large-scale distributed systems. • Expert-level Python and C/C++ programming skills. • Experience operating workloads in scheduled, containerized cluster environments. • Excellent analytical, debugging, and communication skills, with the ability to influence across teams.
• equity • benefits
Apply Now🕒 May 22
AI Infrastructure Project Manager leading complex infrastructure deployment for AI programs across multiple sites. Drive execution, coordination, and risk management initiatives in technical project management.
🇺🇸 United States – Remote
💵 $73k - $100.8k / year
⏰ Full Time
🟠 Senior
🔴 Lead
🗣️ LLM Engineer
🦅 H1B Visa Sponsor
🕒 May 22
Technical Project Manager leading AI infrastructure deployment across multiple business units and sites. Delivering execution excellence through project management of GPU, compute, and storage systems.
🇺🇸 United States – Remote
💵 $73k - $100.8k / year
⏰ Full Time
🟠 Senior
🔴 Lead
🗣️ LLM Engineer
🦅 H1B Visa Sponsor
🕒 May 20
As AI Infrastructure Supply Chain Lead, strategic architect for global AI supply chain at Armada. Leading sourcing for HPC and sovereign AI cloud platforms.
🇺🇸 United States – Remote
💵 $130.8k - $163.5k / year
💰 $47.3M Series A on 2023-12
⏰ Full Time
🟠 Senior
🗣️ LLM Engineer
🦅 H1B Visa Sponsor
🕒 May 20
Sr. Director of AI Infrastructure at ARMRA connecting business systems into a unified intelligent automation layer. Owning scalable architecture and AI playbook development across all functions.
🕒 May 14
Product Marketing Manager defining market narratives and positioning for SubQ’s AI infrastructure products. Collaborating with cross-functional teams to drive product launches and technical messaging strategy.