Senior Software Engineer, RL Post-Training Frameworks

🔥 34 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Architect and build RL post-training infrastructure that scales efficiently from experimentation on a single GPU to production across thousands of nodes. • Tuning RL training-inference-rollout loops on GPUs, CPUs, and LPUs for performance where it matters. • Contributing to and improving the performance and usability of open-source RL frameworks. • Partnering with teams building CPU-driven rollout workloads, including tool-use, code execution, and agentic environments. • Advocating for researcher and partner needs with NVIDIA's networking, math library, and compiler teams.

🎯 Requirements

• MS or PhD in Computer Science, Computer Engineering, or a related field (or equivalent experience) • 5+ years of professional experience in distributed systems, high-performance computing, deep learning infrastructure, or ML systems engineering • Strong proficiency in Python and C/C++ • Demonstrated experience building or contributing to large-scale distributed systems or runtime frameworks in production at a frontier AI lab, hyperscaler, or major technology company • Strong verbal and written communication skills and the ability to collaborate across organizational and geographic boundaries • Depth in one or more of the following technical areas: Reinforcement learning for LLM post-training (RLHF, PPO, GRPO, DPO, reward modeling), including how algorithms map to distributed execution and the systems challenges they create (heterogeneous placement, rollouts, environment execution, resharding between training and generation) • PyTorch internals, including distributed training primitives (FSDP, tensor parallelism, pipeline parallelism) and their composition • Kubernetes runtime internals (container lifecycle, pod scheduling, resource quotas, GPU allocation) • End-to-end distributed systems design (service boundaries, data flows, consistency models, failure modes, recovery approaches)

🏖️ Benefits

• Highly competitive salaries • Comprehensive benefits package

Apply Now

Similar Jobs

🔥 1 hour ago

RockstarDevelopers GmbH

11 - 50

🏢 Enterprise

🤖 Artificial Intelligence

🤝 B2B

Senior Fullstack Engineer working on public sector software projects in Germany. Responsible for full-stack development in regulated environments with a focus on AI integration.

🗣️🇩🇪 German Required

Angular

Docker

Java

JavaScript

Jenkins

Kubernetes

MariaDB

Next.js

Oracle

Postgres

React

Spring

Spring Boot

SpringBoot

TypeScript

🔥 6 hours ago

Caseware

201 - 500

💸 Finance

🏢 Enterprise

☁️ SaaS

Software Developer in Test at Caseware focusing on QA and test automation for SaaS applications. Collaborating with developers and mentoring junior team members in a dynamic environment.

🗣️🇳🇱 Dutch Required

Cloud

Cypress

SDLC

🔥 7 hours ago

TradeLink

51 - 200

🚗 Transport

🏢 Enterprise

☁️ SaaS

Senior Product Engineer driving the development of AI-powered logistics solutions for TradeLink. Collaborating with a dynamic team to enhance the B2B SaaS logistics platform.

🗣️🇩🇪 German Required

React

Ruby

Ruby on Rails

🔥 14 hours ago

YAZIO

51 - 200

🧘 Wellness

👥 B2C

☁️ SaaS

Senior Internal Tooling Engineer at Yazio accelerating productivity through innovative internal tools and AI. Tackling bottlenecks and driving efficiency across teams.

Python

Go

🕒 Yesterday

PCG DACH

201 - 500

🔒 Cybersecurity

Senior Google Workspace Engineer leading architecture and solutions for enterprise clients. Collaborating on cloud strategies and implementing generative AI solutions to enhance productivity.

🗣️🇩🇪 German Required

Cloud