HPC Operations Engineer

November 9, 2023

Apply Now

Loading...

Lambda

Designing the world's most advanced GPU systems for Deep Learning.

Deep Learning • Machine Learning • Artificial Intelligence

51 - 200

💰 $39.7M Venture Round on 2022-11

Description

• Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes) • Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools • Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site • Provide context and details to an automation team to further automate the deployment process • Provide clear and detailed requirements back to HPC design team on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency • Contribute to the creation and maintenance of Standard Operating Procedures • Provide regular and well-communicated updates to project leads throughout each deployment • Stay up-to-date on the latest HPC/AI technologies and best practices

Requirements

• Good understanding of HPC/AI architecture, operating systems, firmware, software, and networking • 3+ years of experience in deploying and configuring HPC clusters for AI workloads • Innate attention to detail • Familiarity with Bright Cluster Manager or similar cluster management tools • Experience in configuring and troubleshooting: SFP+ fiber, InfiniBand (IB), and 100 GbE network fabrics, Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments, Linux-based compute nodes, firmware updates, driver installation, SLURM, Kubernetes, or other job scheduling systems • Ability to work well under deadlines and structured project plans also knowing when and how to ask for changes to project timelines • Solid problem-solving and troubleshooting skills • Flexibility to travel to North American data centers when needed • Ability to work independently and as part of a team • Nice to Have: experience with machine learning and deep learning frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf), experience with containerization technologies (Docker, Kubernetes), experience working with GPU acceleration, virtualization, and cloud computing technologies • Situational awareness in customer situations, employing diplomacy and tact • Bachelor's degree in EE, CS, Physics, Mathematics, or equivalent work experience

Benefits

• Generous cash & equity compensation • Health, dental, and vision coverage for you and your dependents • Commuter/Work from home stipends • 401k Plan • Flexible Paid Time Off Plan • Salary Range: $120,000-$160,000, depending on qualifications

Apply Now
Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com
Jobs by Title
Remote Account Executive jobsRemote Accounting, Payroll & Financial Planning jobsRemote Administration jobsRemote Android Engineer jobsRemote Backend Engineer jobsRemote Business Operations & Strategy jobsRemote Chief of Staff jobsRemote Compliance jobsRemote Content Marketing jobsRemote Content Writer jobsRemote Copywriter jobsRemote Customer Success jobsRemote Customer Support jobsRemote Data Analyst jobsRemote Data Engineer jobsRemote Data Scientist jobsRemote DevOps jobsRemote Engineering Manager jobsRemote Executive Assistant jobsRemote Full-stack Engineer jobsRemote Frontend Engineer jobsRemote Game Engineer jobsRemote Graphics Designer jobsRemote Growth Marketing jobsRemote Hardware Engineer jobsRemote Human Resources jobsRemote iOS Engineer jobsRemote Infrastructure Engineer jobsRemote IT Support jobsRemote Legal jobsRemote Machine Learning Engineer jobsRemote Marketing jobsRemote Operations jobsRemote Performance Marketing jobsRemote Product Analyst jobsRemote Product Designer jobsRemote Product Manager jobsRemote Project & Program Management jobsRemote Product Marketing jobsRemote QA Engineer jobsRemote SDET jobsRemote Recruitment jobsRemote Risk jobsRemote Sales jobsRemote Scrum Master + Agile Coach jobsRemote Security Engineer jobsRemote SEO Marketing jobsRemote Social Media & Community jobsRemote Software Engineer jobsRemote Solutions Engineer jobsRemote Support Engineer jobsRemote Technical Writer jobsRemote Technical Product Manager jobsRemote User Researcher jobs