Tech Lead – Deployment & Operations, Custom Infrastructure

🕒 May 16

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of OpenAI

OpenAI

WebsiteLinkedIn

201 - 500 employees

Founded 2015

🤖 Artificial Intelligence

☁️ SaaS

🏢 Enterprise

Artificial Intelligence • SaaS • Enterprise

OpenAI is a leading research organization and company dedicated to creating advanced artificial intelligence technology, with a strong emphasis on safety and ethical considerations. OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. The company develops AI products like ChatGPT, which can assist users with tasks ranging from everyday requests to complex enterprise solutions. OpenAI also provides an API platform that integrates its AI models into various applications. The company is focused on innovation in AI and improving data analysis capabilities, while emphasizing safety and ethical governance of their systems.

📋 Description

• Lead a team responsible for deployment and operations of OpenAI’s custom silicon and systems in data center environments • Own the path from hardware bring-up and validation through production deployment, operational readiness, and sustained fleet support • Partner closely with silicon, systems, software, infrastructure, networking, data center, supply chain, and external partner teams to ensure successful deployment at scale • Define deployment processes, operational playbooks, technical readiness criteria, escalation paths, and reliability practices for new hardware platforms • Drive cross-functional execution across lab bring-up, rack/system integration, data center deployment, fleet monitoring, debugging, and issue resolution • Stay hands-on technically through architecture reviews, deployment planning, failure analysis, operational debugging, and critical system-level decision-making • Identify gaps in tooling, observability, automation, validation coverage, and operational processes, and build plans to close them • Establish clear metrics for deployment readiness, reliability, performance, maintainability, and operational health • Build a strong engineering culture grounded in ownership, technical rigor, operational excellence, and high-velocity execution • Ensure OpenAI’s custom hardware platforms can be deployed and operated reliably, repeatably, and safely at scale • Be a contributor and technical driver for the architecture and design of future ML systems

🎯 Requirements

• 8+ years of engineering experience in hardware systems, infrastructure, data center deployment, production operations, systems engineering, silicon bring-up, or related technical domains • Strong technical depth in one or more of: hardware deployment, data center operations, rack-scale systems, silicon bring-up, systems validation, fleet operations, reliability engineering, infrastructure automation, or hardware/software integration • Experience bringing complex hardware systems from development or validation into production environments • Experience working closely with silicon, systems, software, infrastructure, networking, or data center teams • Experience with deployment planning, operational readiness, incident response, debugging, and root-cause analysis for production systems • Experience building tooling, automation, observability, or operational processes that improve deployment quality and fleet reliability • Demonstrated ability to hire, develop, and lead senior technical talent • Ability to move fluidly between people leadership, technical strategy, and hands-on operational problem solving • Strong written and verbal communication skills, especially in high-urgency, cross-functional technical environments • Experience working in fast-moving environments.

🏖️ Benefits

• Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit) • 401(k) retirement plan with employer match • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks) • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law) • Mental health and wellness support • Employer-paid basic life and disability coverage • Annual learning and development stipend to fuel your professional growth • Daily meals in our offices, and meal delivery credits as eligible • Relocation support for eligible employees • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

Apply Now

Similar Jobs

🕒 May 15

Infinitus Systems, Inc.

11 - 50

🤖 Artificial Intelligence

⚕️ Healthcare Insurance

☁️ SaaS

WebsiteLinkedIn

Backend Software Engineer at Infinitus, designing and scaling AI-powered healthcare solutions. Collaborating with Product, Design, and Engineering teams for end-to-end product development.

🏢🏡 San Francisco – Hybrid

💰 $30M Series B on 2021-11

⏰ Full Time

🟠 Senior

🔴 Lead

🧑‍💻 Full-stack Engineer

AWS

Cloud

Distributed Systems

Google Cloud Platform

Java

NoSQL

Python

React

SQL

TypeScript

Go

🕒 May 15

Anyscale

51 - 200

🤖 Artificial Intelligence

☁️ SaaS

🏢 Enterprise

WebsiteLinkedIn

Software Engineer working on the Ray backend for distributed applications. Leading projects and mentoring junior engineers while improving system performance at Anyscale.

Distributed Systems

Open Source

Ray

🕒 May 14

Concordance Healthcare Solutions

1001 - 5000

🤝 B2B

☁️ SaaS

🏢 Enterprise

WebsiteLinkedIn

Senior Software Engineer developing React web applications for fintech startup. Leading technical architecture decisions and mentoring junior developers in modern JavaScript frameworks.

🏢🏡 San Francisco – Hybrid

⏰ Full Time

🟡 Mid-level

🟠 Senior

🧑‍💻 Full-stack Engineer

🗣️🇪🇸 Spanish Required

AWS

Azure

Cloud

Google Cloud Platform

JavaScript

React

🕒 May 14

Front

201 - 500

☁️ SaaS

🏢 Enterprise

WebsiteLinkedIn

Senior Software Engineer developing the foundations for a React Native-based mobile app at Front. Leading improvements and performance enhancements for Android and iOS platforms.

Android

iOS

Kotlin

React

React Native

TypeScript

🕒 May 14

Harvey

11 - 50

🤖 Artificial Intelligence

🏢 Enterprise

WebsiteLinkedIn

Software Engineer leading engineering projects at Harvey, which transforms professional services with AI. Collaborate with teams on product lines and build secure, scalable solutions.

🏢🏡 San Francisco – Hybrid

💵 $193.4k - $290k / year

💰 $80.6G Series B on 2023-12

⏰ Full Time

🟠 Senior

🧑‍💻 Full-stack Engineer

🦅 H1B Visa Sponsor

info

Postgres

Python

React

TypeScript