Senior AI and ML HPC Cluster Engineer

Job not on LinkedIn

October 19

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage. • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions • Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud • Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving needs • Support our researchers to run their workloads including performance analysis and optimizations • Conduct root cause analysis and suggest corrective action • Proactively find and fix issues before they occur

🎯 Requirements

• Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience • Minimum 5+ years of experience designing and operating large scale compute infrastructure • Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions • Solid understanding of cluster configuration management tools such as Ansible, Puppet, Salt • In-depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud • Proficiency in Python programming and bash scripting • Applied experience with AI/HPC workflows that use MPI • Experience analyzing and tuning performance for a variety of AI/HPC workloads. • Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

October 19

Highmark Health

10,000+ employees

⚕️ Healthcare Insurance

🤝 Non-profit

🌍 Social Impact

AI Consultant facilitating AI solutions and change management for Highmark Health. Collaborating with technical teams to implement AI strategies aligned with business objectives.

🇺🇸 United States – Remote

💵 $92.3k - $172.5k / year

💰 $5M Grant on 2021-05

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🦅 H1B Visa Sponsor

October 18

Gartner

10,000+ employees

🏢 Enterprise

Senior Director Analyst leading the research agenda on data management and AI. Mentoring team and presenting findings to executive stakeholders globally

🇺🇸 United States – Remote

💵 $152k - $190k / year

⏰ Full Time

🟠 Senior

🤖 Artificial Intelligence

🦅 H1B Visa Sponsor

October 17

Cornelis Networks

51 - 200

🤖 Artificial Intelligence

🔧 Hardware

🏢 Enterprise

AI Performance Engineer at Cornelis Networks optimizing training and multi-node inference across next-gen networking solutions for AI. Collaborating with teams for turning lab results into customer wins.

🇺🇸 United States – Remote

💰 $29M Series B on 2022-11

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🦅 H1B Visa Sponsor

October 16

Forum Ventures

51 - 200

🤖 Artificial Intelligence

🤝 B2B

☁️ SaaS

Founder/CEO to build a real-time network intelligence platform for telecom operators. Role involves creating innovative solutions to reconcile network assets with recorded data.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

October 16

Forum Ventures

51 - 200

🤖 Artificial Intelligence

🤝 B2B

☁️ SaaS

Founder/CEO to build an AI-driven compliance onboarding platform for construction firms. Leveraging a $250K investment and full-stack support from Forum's Venture Studio.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com