Senior Software Engineer – NVLink Rack Scale Stability and Reliability

🕒 May 22

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems. • Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support. • Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution. • Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments. • Collaborate with architecture, hardware, firmware, software, and Customer engagement teams to improve system quality and reliability. • Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readiness. • Create automation, dashboards, runbooks, and debug workflows that improve root-cause analysis and operational efficiency.

🎯 Requirements

• BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field, or equivalent experience. • 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems. • Strong programming skills in C/C++ and Python; Bash/Shell scripting experience is a plus. • Strong system-level debugging across software, firmware, hardware, and networking layers. • Solid networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis. • Experience with large-scale AI systems, including platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging. • Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods. • Strong communication and collaboration skills across engineering, customer, and operations teams. • Passion for building reliable next-generation AI infrastructure and solving complex system-level challenges at scale.

🏖️ Benefits

• Eligible for equity and benefits

Apply Now

Similar Jobs

🕒 May 22

Inovalon

1001 - 5000

🤖 Artificial Intelligence

Senior Software Development Engineer developing healthcare software solutions with .NET and Angular. Collaborating on AI integration and cloud migration for clinical workflow improvements.

Angular

AWS

Cloud

MS SQL Server

Python

SQL

.NET

🕒 May 22

NextLink Labs

11 - 50

🤝 B2B

🏢 Enterprise

🔒 Cybersecurity

Software Architect at NextLink Labs building and maintaining web applications using Ruby on Rails. Collaborating with clients and mentoring engineers in a remote-first culture.

Django

Java

JavaScript

Kafka

Microservices

Node.js

Postgres

Python

React

RSpec

Ruby

Ruby on Rails

Vue.js

🕒 May 22

ZoomInfo

1001 - 5000

🤝 B2B

☁️ SaaS

🏢 Enterprise

Senior Full Stack Engineer on ZoomInfo's Conversation Intelligence team designing and delivering features across the full stack. Collaborating with engineers and product managers to enhance customer interaction tools.

Angular

DynamoDB

ElasticSearch

JavaScript

MongoDB

MySQL

Node.js

NoSQL

Python

React

🕒 May 22

Sophia

1 - 10

🤝 B2B

📚 Education

🧘 Wellness

Senior Full Stack Engineer designing, building, and supporting the Sophia learning platform's applications. Collaborating with product and business stakeholders, mentoring engineers, and integrating AI capabilities.

AWS

Cloud

Google Cloud Platform

JavaScript

Ruby

🕒 May 22

Strategic Education, Inc

5001 - 10000

📚 Education

🤝 B2B

🏢 Enterprise

Senior Full Stack Engineer designing, building, and supporting applications for the Sophia learning platform. Collaborating with stakeholders to deliver scalable and reliable solutions.

AWS

Cloud

Google Cloud Platform

JavaScript

Ruby