Senior HPC Cluster Engineer

Job not on LinkedIn

🕒 April 20

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Nebius Group

Nebius Group

1001 - 5000 employees

🏱 Enterprise

☁ SaaS

AI ‱ Enterprise ‱ SaaS

Nebius Group is building one of the world’s leading AI infrastructure companies, focusing on providing the necessary compute, storage, and tools for developers in the AI space. Based in Europe and listed on Nasdaq, Nebius has a global presence with R&D centers across Europe, North America, and Israel. The company's primary offering is an AI-centric cloud platform designed for intensive AI workloads, complemented by various other businesses involved in generative AI development, edtech, and autonomous technology.

📋 Description

‱ Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments. ‱ Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions. ‱ Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM. ‱ Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments. ‱ Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.

🎯 Requirements

‱ 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming). ‱ 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning). ‱ In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems. ‱ Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).

đŸ–ïž Benefits

‱ Competitive salary and comprehensive benefits package. ‱ Opportunities for professional growth within Nebius. ‱ Flexible working arrangements. ‱ A dynamic and collaborative work environment that values initiative and innovation.

Apply Now

Similar Jobs

🕒 April 20

DigitalOcean

1001 - 5000

☁ SaaS

Hardware Sustaining Engineer at DigitalOcean supporting server infrastructure and troubleshooting hardware issues in a cloud capacity. Collaborating with teams to improve operational standards and drive efficiency.

Cloud

Python

🕒 April 20

Founding Engineer role at Hermes Web to develop a polished hosted personal AI agent. Responsibilities include owning product and engineering end-to-end with a focus on consumer-grade polish.

JavaScript

Kubernetes

Next.js

Python

React

TypeScript

🕒 April 18

Ebara Elliott Energy

1001 - 5000

⚡ Energy

🔧 Hardware

Controls Engineer providing technical advice and analyzing system controls for rotating equipment, including programming. Coordinating with customers and ensuring operation per design standards throughout North America.

🕒 April 17

GCON Inc.

51 - 200

🏠 Real Estate

Project Engineer supporting large-scale data center project in West Texas, focusing on project management and field operations while requiring relocation. Ideal for someone early in their construction career.

🕒 April 17

General Dynamics Information Technology

10,000+ employees

🔒 Cybersecurity

đŸ€– Artificial Intelligence

Wireless Engineer Lead providing technical oversight for engineering and installation across global Air Force and Space Force sites. Responsible for coordinating tasks and ensuring design accuracy in wireless infrastructure.