GPU Cluster Architect

Job not on LinkedIn

🕒 August 22, 2025

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Nebius Group

Nebius Group

1001 - 5000 employees

🏢 Enterprise

☁️ SaaS

AI • Enterprise • SaaS

Nebius Group is building one of the world’s leading AI infrastructure companies, focusing on providing the necessary compute, storage, and tools for developers in the AI space. Based in Europe and listed on Nasdaq, Nebius has a global presence with R&D centers across Europe, North America, and Israel. The company's primary offering is an AI-centric cloud platform designed for intensive AI workloads, complemented by various other businesses involved in generative AI development, edtech, and autonomous technology.

📋 Description

• Cluster Design: Architect scalable GPU cluster topologies including compute nodes, interconnect (InfiniBand, Ethernet), storage, and control planes. • Performance Modeling: Analyze AI/ML workloads (e.g. LLM training, inference) to inform design tradeoffs across latency, bandwidth, and GPU density. • Network Architecture: Align with network architect relevant design and validate low-latency, high-throughput interconnects (e.g., InfiniBand HDR/NDR, RoCEv2) at POD and DC scale. • Storage Integration: Work with storage teams to optimize performance for training datasets, checkpointing, and others. • Reliability & Monitoring: Understand and analyze signal from monitoring systems to the detect flows in design • Collaboration: Partner with site reliability, networking, storage, and DC engineering teams to operationalize and scale your architecture.

🎯 Requirements

• 5+ years of experience designing clusters. • Deep understanding of modern GPU architecture (NVIDIA, AMD, etc.). • Experience with HPC interconnects (InfiniBand & RoCE). • Solid background in systems architecture, networking, and hardware reliability. • Experience in scripting for automation and telemetry pipelines (Python, Go, etc.)

🏖️ Benefits

• Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families. • 401(k) plan: Up to 4% company match with immediate vesting. • Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers. • Remote work reimbursement: Up to $85/month for mobile and internet. • Disability & life insurance: Company-paid short-term, long-term and life insurance coverage. • Competitive salary and comprehensive benefits package. • Opportunities for professional growth within Nebius. • Hybrid working arrangements. • A dynamic and collaborative work environment that values initiative and innovation.

Apply Now

Similar Jobs

🕒 August 12, 2025

Emids

1001 - 5000

⚕️ Healthcare Insurance

🤖 Artificial Intelligence

☁️ SaaS

Associate Architect designing scalable SFCC architectures for pharmacy eCommerce platforms, ensuring HIPAA compliance and security protocols. Leading project lifecycle from requirement gathering to final deployment.

🕒 August 6, 2025

CACI International Inc

10,000+ employees

🔒 Cybersecurity

Join CACI as a Product Architect to build modern applications in a SecDevOps environment. Lead agile teams while safeguarding national security.

🕒 July 28, 2025

VsimplifyIT

1 - 10

🤝 B2B

⚡ Productivity

☁️ SaaS

VsimplifyIT Consulting seeks a Lead System Architect for cloud solutions to help organizations achieve their goals.

🕒 July 28, 2025

Affinity Outsourcing Limited

51 - 200

🤝 B2B

💸 Finance

☁️ SaaS

As a Pega Lead Decisioning Architect, drive implementation of standards and strategy architecture for marketing solutions.

🕒 July 23, 2025

Strada

5001 - 10000

👥 HR Tech

☁️ SaaS

🤝 B2B

Workday FDM Architect focused on managing client deliverables and leading consulting projects in financial management.