Software Engineer, Workload Enablement

201 - 500 employees

Founded 2015

🤖 Artificial Intelligence

☁️ SaaS

🏢 Enterprise

Artificial Intelligence • SaaS • Enterprise

OpenAI is a leading research organization and company dedicated to creating advanced artificial intelligence technology, with a strong emphasis on safety and ethical considerations. OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. The company develops AI products like ChatGPT, which can assist users with tasks ranging from everyday requests to complex enterprise solutions. OpenAI also provides an API platform that integrates its AI models into various applications. The company is focused on innovation in AI and improving data analysis capabilities, while emphasizing safety and ethical governance of their systems.

Software Engineer, Workload Enablement

🕒 March 28

🏢🏡 San Francisco – Hybrid

💵 $293k - $455k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

🧑‍💻 Full-stack Engineer

🦅 H1B Visa Sponsor

Distributed Systems

Kubernetes

Python

PyTorch

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

OpenAI

201 - 500 employees

Founded 2015

🤖 Artificial Intelligence

☁️ SaaS

🏢 Enterprise

Artificial Intelligence • SaaS • Enterprise

📋 Description

• Port and validate key inference and training workloads on new platforms/SKUs as they arrive; drive correctness, performance, and stability to an internal readiness bar. • Build a suite of benchmarks and stress tests that capture real E2E behavior of our workloads by exercising all aspects of a system, including CPU, GPU, memory subsystem, frontend, scale-up, and scale-out networking (including WAN traffic, NVlink and RDMA collectives), storage, thermals, and any other relevant parts. • Deep-dive performance on distributed training/inference: • Collective performance and tuning (across NCCL/RCCL and internal libraries) • Overlap of compute/communication, kernel-level bottlenecks, memory bandwidth and scheduling effects • Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs (pass/fail, performance score, regression detection). • Partner with systems + fleet bring-up engineers to ensure the platform is not only stable and performant, but also operationally usable and scalable (containerization, K8s integration, telemetry hooks, failure triage loops). • Work cross-functionally with vendors and internal stakeholders by producing clear bug reports, minimal repros, and prioritized issue lists.

🎯 Requirements

• BS in CS/EE (or equivalent practical experience). • 5+ years in one or more of: ML systems, performance engineering, distributed systems, or HPC. • Strong hands-on experience with: • PyTorch and modern LLM training/inference stacks • Large-scale distributed training concepts (data/model/pipeline parallel, collective comms) • Experience with RDMA and debugging/optimizing comms libraries (NCCL or RCCL) and their interaction with hardware/network • Proficiency in Python plus comfort reading/writing performance-critical code (C++/CUDA/HIP is a plus). • Strong profiling/debugging skills (e.g., Nsight, rocprof, perf, flamegraphs; ability to reason from traces/counters).

🏖️ Benefits

• Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit) • 401(k) retirement plan with employer match • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks) • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law) • Mental health and wellness support • Employer-paid basic life and disability coverage • Annual learning and development stipend to fuel your professional growth • Daily meals in our offices, and meal delivery credits as eligible • Relocation support for eligible employees • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

Apply Now

Similar Jobs

Full-Stack Product Engineer

🕒 March 27

LlamaIndex

1 - 10

Full-Stack Product Engineer responsible for building AI products for our SaaS platform. Collaborating with customers to develop and iterate features while managing both front-end and back-end technology.

🏢🏡 San Francisco – Hybrid

💵 $150k - $230k / year

⏰ Full Time

🟢 Junior

🟡 Mid-level

🧑‍💻 Full-stack Engineer

Cloud

Docker

JavaScript

Kubernetes

Next.js

Node.js

Python

React

Terraform

TypeScript

Senior Software Engineer – AI Tools

🕒 March 27

Airwallex

1001 - 5000

💳 Fintech

💸 Finance

Senior Software Engineer developing AI tools for global payments platform. Collaborating with teams to enhance internal automation and security outcomes.

🏢🏡 San Francisco – Hybrid

💵 $200k - $250k / year

⏰ Full Time

🟠 Senior

🧑‍💻 Full-stack Engineer

🦅 H1B Visa Sponsor

Cloud

ETL

Kubernetes

Python

Senior Software Engineer, Core Infrastructure

🕒 March 27

Harvey

11 - 50

🤖 Artificial Intelligence

🏢 Enterprise

Senior Software Engineer designing and building scalable infrastructure systems for AI platform. Contributing to operational excellence and mentoring junior engineers in a fast-paced environment.

🏢🏡 San Francisco – Hybrid

💵 $200k - $250k / year

💰 $80.6G Series B on 2023-12

⏰ Full Time

🟠 Senior

🧑‍💻 Full-stack Engineer

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Distributed Systems

Google Cloud Platform

Kubernetes

Python

Terraform

Software Engineer – Infra

🕒 March 27

Numeral (YC W23)

1 - 10

🛍️ eCommerce

☁️ SaaS

📋 Compliance

Software Engineer building reliable infrastructure at Numeral, transforming tax processes for businesses. Collaborating with teams to ensure scalability and improve service reliability in a high-growth startup environment.

🏢🏡 San Francisco – Hybrid

💵 $180k - $300k / year

💰 $500k Pre Seed Round on 2023-04

⏰ Full Time

🟠 Senior

🔴 Lead

🧑‍💻 Full-stack Engineer

AWS

Cloud

Distributed Systems

JavaScript

Kubernetes

Node.js

Postgres

Prometheus

Redis

Software Engineer – Forward Deployed

🕒 March 27

Ramp

501 - 1000

💳 Fintech

💸 Finance

🏢 Enterprise

Software Engineer creating scalable solutions for Ramp's largest enterprise customers. Collaborating with cross-functional teams and driving technical project decisions.

🏢🏡 San Francisco – Hybrid

💵 $168k - $280k / year

💰 $15M Series C on 2012-09

⏰ Full Time

🟢 Junior

🟡 Mid-level

🧑‍💻 Full-stack Engineer

🦅 H1B Visa Sponsor