Designing the world's most advanced GPU systems for Deep Learning.
Deep Learning • Machine Learning • Artificial Intelligence
51 - 200
💰 $39.7M Venture Round on 2022-11
March 15
Loading...
Designing the world's most advanced GPU systems for Deep Learning.
Deep Learning • Machine Learning • Artificial Intelligence
51 - 200
💰 $39.7M Venture Round on 2022-11
• Remotely install, upgrade, operate and maintain bare-metal Kubernetes clusters (up to thousands of nodes each) • Handle cluster degradation, recovery and resizing using our fleet management tooling • Perform out-of-hours on-call response for critical incidents as part of a well-balanced on-call rotation • Work on improving our tooling, automation, and processes, for both daily operations, alerting, and incident response • Dive into systems at a low level to solve unique cluster problems and write up your findings • Assist customers with high-level Kubernetes questions and integration with applications, storage and authentication • Assist with initial cluster build-outs and validation to help identify failed hardware before customer delivery • Work closely with our HPC Ops and Datacenter Ops teams on issues that require lower-level expertise or cross-functional solutions • Mentor and assist less-experienced team members • Have a voice in our product direction and help us think about how to minimize operational costs and complexity
• An experienced operations engineer, SRE, sysadmin or similar with a deep knowledge of running Linux clusters and systems • Very familiar with running on bare-metal (including knowledge of BMCs, kernel drivers, PXE, RAID, VLANs, hypervisors) • A good understanding of containers, virtualisation, and the mechanisms underpinning them • A good understanding of daily operation, bug-fixing and maintenance of Kubernetes • Experience in an on-call environment and with incident response • Ability to perform incident post-mortems and develop procedures and tooling to prevent root causes from reoccurring • An excellent ability to learn on-the-fly and adapt to solve problems • Able to work either independently with limited direction, or as part of a team • Able to work with customers during incidents either via tickets, live messaging, or as part of a larger call.
• Health, dental, and vision coverage for you and your dependents • Commuter/Work from home stipends • 401k Plan with 2% company match • Flexible Paid Time Off Plan that we all actually use
Apply NowMarch 15
March 15
201 - 500
🇺🇸 United States – Remote
💵 $104.1k - $122.5k / year
💰 Secondary Market on 2015-05
⏰ Full Time
🟡 Mid-level
🟠 Senior
⚙️ Operations
March 14
11 - 50
March 14
March 14
201 - 500
🇺🇸 United States – Remote
💵 $74k - $94k / year
💰 $350M Series D on 2021-12
⏰ Full Time
🟡 Mid-level
🟠 Senior
⚙️ Operations
🗽 H1B Visa Sponsor
March 13
201 - 500
🇺🇸 United States – Remote
💵 $185k - $240k / year
💰 Post-IPO Equity on 2022-11
⏰ Full Time
🟠 Senior
⚙️ Operations
🗽 H1B Visa Sponsor