Principal Software Engineer – Rack Scale Systems Infrastructure

🕒 May 16

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Define the complete software architecture for rack-scale infrastructure products and services, covering control plane services, infrastructure management, firmware, operating systems, kernel drivers, networking fabrics, accelerator software, and user-mode manageability software. • Use Kubernetes and cloud-native primitives as an infrastructure fabric when appropriate. This includes controllers, operators, reconciliation loops, and open source components. These components can operate safely at rack and fleet scale. • Build open source infrastructure software that can be embraced in different forms, including libraries, services, controllers, operators, and integration APIs for internal deployments and CSP environments. • Bridge hardware and software teams across firmware, BMC, BIOS, boot flows, OS images, drivers, networking, NVLink domains, InfiniBand, GPUs, DPUs, CPUs, and system management interfaces. • Translate forward-looking infrastructure roadmaps into formal software requirements, architecture specifications, and execution plans that align teams across the organization. • Partner directly with hyperscalers, CSPs, enterprise customers, internal component leads, vendors, and business partners to align infrastructure capabilities with real-world deployment and integration needs. • Establish reliability, security, validation, and left-shift strategies that reduce risk before hardware reaches production environments. • Mentor senior engineers and technical leads, raising the engineering bar for large-scale networked systems, foundational software, and rack-scale control plane development. • Make high-quality technical decisions in ambiguous environments, balancing customer needs, schedule, hardware realities, software maintainability, open source adoption, and long-term infrastructure evolution.

🎯 Requirements

• BS or MS in Computer Engineering, Computer Science, Electrical Engineering, or a related field, or equivalent experience. • Proven experience (15+ years) in systems architecture, system software, distributed systems, infrastructure control planes, or infrastructure engineering. • Solid architectural knowledge of coordination frameworks, state machines, declarative APIs, reconciliation loops, lifecycle orchestration, failure handling, upgrade and rollback workflows, and distributed systems tradeoffs. • Practical coding skills in Go, C++, or Rust, encompassing the capability to write, review, and direct production-quality infrastructure software. • Experience with Rust is highly valued. • Experience with Kubernetes or similar orchestration systems, especially as a fabric for managing infrastructure, hardware resources, or large-scale infrastructure services. • Experience with Linux-based infrastructure software, OS rollout and image management, kernel or driver interactions, firmware lifecycle, and hardware bring-up workflows. • Strong understanding of data center networking technologies and protocols, such as Ethernet, InfiniBand, RDMA, and fabric-level manageability. • Experience with complex accelerator-based systems, including GPUs, DPUs, FPGAs, custom silicon, or other high-performance computing systems. • Expertise in in-band and out-of-band management architectures, including BMCs, Redfish, IPMI, and related system management protocols. • Ability to work with security experts to define practical tradeoffs across secure boot, attestation, access control, update safety, serviceability, and ease of operation. • Experience crafting software intended for open source release, including API stability, modularity, documentation, community usability, and clean separation between shared software and deployment-specific integrations. • Experience using AI-assisted development tools responsibly as an engineering multiplier for coding, test generation, debugging, build iteration, and documentation. • Established skill in specifying requirements, guiding architecture, and managing delivery across various engineering teams and organizations. • Strong written and verbal communication skills, enabling clear explanation of complex hardware/software tradeoffs to engineering leaders, customers, partners, and executives.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

🕒 May 16

Forward Financing

201 - 500

💸 Finance

💳 Fintech

🤝 B2B

Staff Software Engineer leading frontend systems at a fintech company focused on empowering small businesses with flexible funding options. Setting technical direction and promoting operational excellence within engineering teams.

🇺🇸 United States – Remote

💵 $170k - $200k / year

💰 $250M Debt Financing on 2021-05

⏰ Full Time

🔴 Lead

🧑‍💻 Full-stack Engineer

🕒 May 16

Fieldguide

11 - 50

🤖 Artificial Intelligence

💸 Finance

☁️ SaaS

Staff Platform Engineer designing and building foundational platform services for Fieldguide, a fintech company automating assurance and audit work. Leading technical architecture and mentoring engineers across teams.

AWS

Cloud

Distributed Systems

Kubernetes

🕒 May 16

Stratus

501 - 1000

🤝 B2B

🏢 Enterprise

🤖 Artificial Intelligence

Principal Full Stack Engineer for Stratus, delivering innovative SaaS solutions for MEP contractors. Leading product development in a collaborative, cross-functional Labs team.

AWS

Azure

JavaScript

Kubernetes

Node.js

NoSQL

SQL

TypeScript

Vue.js

.NET

🕒 May 16

Praia Health

11 - 50

⚕️ Healthcare Insurance

☁️ SaaS

Staff Software Engineer responsible for data infrastructure at Praia Health, focusing on scalable healthcare solutions and enterprise integrations.

Apache

AWS

Azure

Cloud

Distributed Systems

Google Cloud Platform

Java

Kubernetes

Python

SDLC

Spark

Terraform

🕒 May 16

Bamboo Health

501 - 1000

⚕️ Healthcare Insurance

☁️ SaaS

💳 Fintech

Staff Software Engineer developing innovative real-time care intelligence solutions at Bamboo Health. Collaborating on high-impact projects and enhancing workflows through technology advancements.

AWS

Cloud

Distributed Systems

Java

Spring

Spring Boot

SpringBoot

SQL