Senior DGX Cloud AI Infrastructure Software Engineer

🕒 February 3

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Develop infrastructure software and tools for large-scale pre-training, post-training, and inference. • Develop and optimize tools and libraries to improve infrastructure efficiency and resiliency. • Co-design and implement APIs for integration with NVIDIA's resiliency stacks. • Enhance infrastructure and products underpinning NVIDIA's AI platforms. • Define meaningful and actionable reliability metrics to track and improve system and service reliability. • Skilled in problem-solving, root cause analysis, and optimization. • Root cause and analyze and triage failures from the application level to the hardware level.

🎯 Requirements

• Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems. • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience). • Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level. • Experience with observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki). • Proven track record in building and scaling large-scale distributed systems. • Experience with AI training and inferencing infrastructure services. • Proficiency in programming languages such as Python, C/C++, script languages. • Experience in quality software engineering practices, including test development, defensive programming, version control, and CI. • Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

🕒 January 2

PNY Technologies

501 - 1000

🔧 Hardware

🤝 B2B

👥 B2C

Product Manager responsible for marketing Data Center, GPU, and AI Infrastructure products. Collaborating across teams to take products to market while managing product lifecycles and strategy.

🕒 October 21, 2025

N-Power Medicine, Inc.

11 - 50

🧬 Biotechnology

⚕️ Healthcare Insurance

💊 Pharmaceuticals

Senior LLM Operations Engineer at N-Power Medicine. Responsible for scaling AI innovation in clinical variable abstraction and note generation through infrastructure and system automation.

AWS

Azure

Cloud

Docker

Google Cloud Platform

Jenkins

Kubernetes

Python

🕒 October 8, 2025

BPK Technologies

51 - 200

🤝 B2B

🏢 Enterprise

🤖 Artificial Intelligence

Senior Software Engineer developing Generative AI solutions for Veltris. Leading software development life cycle and driving innovation across products.

AWS

Azure

Cloud

Distributed Systems

Google Cloud Platform

SDLC