Senior AI and ML HPC Cluster Engineer

🕒 April 24

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage. • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions • Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud • Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving user needs • Support our researchers to run their workloads including performance analysis and optimizations • Conduct root cause analysis and suggest corrective action • Proactively find and fix issues before they occur

🎯 Requirements

• Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience • Minimum 5+ years of experience designing and operating large scale compute infrastructure • Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions • Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt • In depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud • Proficiency in Python programming and bash scripting • Applied experience with AI/HPC workflows that use MPI • Experience analyzing and tuning performance for a variety of AI/HPC workloads.

🏖️ Benefits

• equity • health insurance • retirement plans • paid time off • flexible work arrangements • professional development

Apply Now

Similar Jobs

🕒 April 24

1Password

501 - 1000

🔒 Cybersecurity

☁️ SaaS

⚡ Productivity

AI Engineer implementing scalable AI solutions for Customer Experience at 1Password. Collaborating to enhance workflows and drive operational efficiency with automation.

🕒 April 24

Alteryx

1001 - 5000

🤖 Artificial Intelligence

🤝 B2B

AI Operations Lead driving AI transformation strategy within Marketing at Alteryx. Responsible for architecting rapid AI solutions and fostering cross-functional collaboration.

🕒 April 24

EnvisionWare, Inc.

51 - 200

📚 Education

☁️ SaaS

🔧 Hardware

AI-Directed Engineer developing and delivering software by directing AI tools. Focus on creating production-quality code using AI to enhance workflows.

🕒 April 23

Montauk Capital

1 - 10

💸 Finance

⚡ Energy

☁️ SaaS

Founding CEO for AI Economics OS at Montauk Capital. Leading product, team, and commercial strategy to create a market standard in AI economics.

🕒 April 23

Armada

51 - 200

📡 Telecommunications

🤖 Artificial Intelligence

🏢 Enterprise

Value Engineer at Armada quantifying the economic narratives of AI compute in infrastructure. Working with financial data and customer insights to drive value for stakeholders.