Software Engineer – GPU Networking, Distributed Systems

🕒 February 24

🏢🏡 San Francisco – Hybrid

💵 $150k - $250k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

🧑‍💻 Full-stack Engineer

🦅 H1B Visa Sponsor

info
Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Baseten

Baseten

WebsiteLinkedIn

11 - 50 employees

Founded 2020

🤖 Artificial Intelligence

☁️ SaaS

🏢 Enterprise

💰 $8M Seed Round on 2022-04

Artificial Intelligence • SaaS • Enterprise

Baseten is a company that provides fast, scalable model inference services, designed for performance, security, and a delightful developer experience. They offer tools to streamline the entire development process, enabling high-throughput inference and fast deployment times. Baseten caters to enterprise companies by delivering robust, secure, and scalable model serving solutions, particularly useful for machine learning and AI model deployment. Their solutions allow organizations to efficiently manage model infrastructure while focusing on creating domain-specific models. Baseten supports open-source model packaging and offers autoscaling features to handle varying demand efficiently.

📋 Description

• Make RDMA First-Class: Integrate RDMA/RoCE/InfiniBand capabilities into our inference stack. • Optimize Distributed Inference: Implement and tune networking layers for Disaggregated KV Cache Offload and WideEP. • Enable Serverless-Grade Startup Speeds for LLMs: Work with checkpointing and storage for sub-10-second startup for models. • Deep-Dive into Hardware: Validate networking performance on bleeding-edge clusters and write acceptance tests. • Build Observability: Design tools to visualize packet flow and diagnose distributed system behaviors. • Optimize Kernels: Work with communication libraries (NCCL, NVSHMEM) and write custom kernels to overlap compute and data transfer.

🎯 Requirements

• Deep experience with high-performance networking protocols (InfiniBand, RoCE v2) and understand the physics of data movement. • Fluent in C++ or Python, with the ability to bridge the gap between high-level logic and hardware. • Deep understanding of the memory hierarchy in modern NVIDIA architectures (H100/Blackwell) and know how to optimize for it. • Experience with NCCL, NVSHMEM, and UCX is highly preferred. • Experience with GPUDirect Storage (GDS) or high-performance filesystems like Weka or 3FS. • Familiarity with TensorRT-LLM, vLLM, or Sglang is a plus. • Experience running low-level benchmarks to "qualify" new hardware clusters.

🏖️ Benefits

• Competitive compensation, including meaningful equity. • 100% coverage of medical, dental, and vision insurance for employee and dependents • Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!) • Paid parental leave • Company-facilitated 401(k) • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Apply Now

Similar Jobs

🕒 February 23

OpenAI

201 - 500

🤖 Artificial Intelligence

☁️ SaaS

🏢 Enterprise

WebsiteLinkedIn

Software Engineer specializing in infrastructure systems for ChatGPT. Develop core abstractions and tooling to support engineering teams in fast iterations.

🕒 February 19

Vercel

201 - 500

☁️ SaaS

🌐 Web 3

WebsiteLinkedIn

Software Engineer responsible for developing the Vercel Dashboard for user interaction and experience optimization. Working across the stack to build personalized, agent-powered surfaces for users.

🏢🏡 San Francisco – Hybrid

💵 $196k - $294k / year

💰 $150M Series D on 2021-11

⏰ Full Time

🟡 Mid-level

🟠 Senior

🧑‍💻 Full-stack Engineer

🦅 H1B Visa Sponsor

info

🕒 February 19

Aquabyte

11 - 50

🌾 Agriculture

🤖 Artificial Intelligence

WebsiteLinkedIn

Senior Backend Engineer developing systems for real-time video streaming, AI analysis, and industrial machinery control. Collaborating on cloud and edge systems focusing on reliability and security.

🏢🏡 San Francisco – Hybrid

💵 $140k - $170k / year

⏰ Full Time

🟠 Senior

🧑‍💻 Full-stack Engineer

🕒 February 19

OneCrew

1 - 10

🤝 B2B

WebsiteLinkedIn

Software Engineer at OneCrew designing and building AI-driven features for the construction industry. Collaborating with teams to develop reliable systems and improve operations.

🏢🏡 San Francisco – Hybrid

💵 $150k - $210k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

🧑‍💻 Full-stack Engineer

🕒 February 19

OpenAI

201 - 500

🤖 Artificial Intelligence

☁️ SaaS

🏢 Enterprise

WebsiteLinkedIn

FullStack Software Engineer developing systems for Codex desktop application and IDE extension at OpenAI. Building end-to-end features and ensuring usability, performance, and reliability.