Senior System Software Engineer, NCCL – Partner Enablement

October 2

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Engage with our partners and customers to root cause functional and performance issues reported with NCCL • Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters • Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.) • Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters • Document and conduct trainings/webinars for NCCL • Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.

🎯 Requirements

• B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience. • Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM) • Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design • Experience working with engineering or academic research community supporting HPC or AI • Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control • Expert in Linux fundamentals and a scripting language, preferably Python • Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible) • Adaptability and passion to learn new areas and tools • Flexibility to work and communicate effectively across different teams and timezones

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

October 2

Senior Software Engineer designing and implementing secure systems for Veeva’s Vault CRM. Focus on building scalable, robust cloud infrastructures to transform the life sciences industry.

Ansible

AWS

Cloud

EC2

Grafana

Java

Kubernetes

Prometheus

Spring Boot

SpringBoot

Terraform

October 2

Senior Software Engineer designing and implementing a secure cloud system for Veeva. Working on scalable solutions with diverse cloud and open-source technologies in a leading life sciences company.

Ansible

AWS

Cloud

EC2

Grafana

Java

Kubernetes

Prometheus

Spring Boot

SpringBoot

Terraform

October 2

Product Engineer helping build tools that deliver coherent software engineering context. Focusing on improving engineering efficiency and reducing distractions through AI.

October 2

Tech Lead for healthcare analytics platform guiding engineering pod and shaping product architecture. Collaborate with various teams to ensure quality deliveries and high-impact technology solutions.

MongoDB

NoSQL

React

TypeScript

October 2

Software Engineer developing full stack applications at SmithRx, a health-tech company focused on transforming pharmacy benefit management. Collaborating across the engineering lifecycle with innovative technologies.

AWS

Docker

GraphQL

Java

JavaScript

Kubernetes

Node.js

Open Source

Postgres

React

SQL

TypeScript

Go

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com