Lambda

Website LinkedIn All Job Openings

Designing the world's most advanced GPU systems for Deep Learning.

Deep Learning • Machine Learning • Artificial Intelligence

51 - 200

💰 $39.7M Venture Round on 2022-11

Senior HPC Operations Engineer

November 9, 2023

🇺🇸 United States – Remote

💵 $170k - $230k / year

⏰ Full Time

🟠 Senior

⚙️ Operations

🗽 H1B Visa Sponsor

Apply Now

Lambda

Website LinkedIn All Job Openings

Designing the world's most advanced GPU systems for Deep Learning.

Deep Learning • Machine Learning • Artificial Intelligence

51 - 200

💰 $39.7M Venture Round on 2022-11

Description

• Remotely provision and manage large-scale HPC clusters for AI workloads (up to many thousands of nodes) • Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools • Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site • Provide context and details to an automation team to further automate the deployment process • Provide clear and detailed requirements back to HPC design team on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency • Contribute to the creation and maintenance of Standard Operating Procedures • Provide regular and well-communicated updates to project leads throughout each deployment • Mentor and assist less-experienced team members • Stay up-to-date on the latest HPC/AI technologies and best practices

Requirements

• 10+ years of experience in managing HPC clusters • 10+ years of everyday Linux experience • Strong understanding of HPC architecture (compute, networking, storage) • Innate attention to detail • Experience with Bright Cluster Manager or similar cluster management tools • Expert in configuring and troubleshooting: SFP+ fiber, InfiniBand (IB), and 100 GbE network fabrics, Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments, Linux-based compute nodes, firmware updates, driver installation, SLURM, Kubernetes, or other job scheduling systems • Work well under deadlines and structured project plans • Excellent problem-solving and troubleshooting skills • Flexibility to travel to our North American data centers as on-site needs arise or as part of training exercises • Able to work both independently and as part of a team • Nice to Have: Experience with machine learning and deep learning frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf), Experience with containerization technologies (Docker, Kubernetes), Experience working with the technologies that underpin our cloud business (GPU acceleration, virtualization, and cloud computing), Keen situational awareness in customer situations, employing diplomacy and tact, Bachelor's degree in EE, CS, Physics, Mathematics, or equivalent work experience

Benefits

• Generous cash & equity compensation • Health, dental, and vision coverage for you and your dependents • Commuter/Work from home stipends • 401k Plan • Flexible Paid Time Off Plan that we all actually use

Apply Now