AI Cluster Architect

Job not on LinkedIn

🕒 February 25

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Vultr

Vultr

201 - 500 employees

Founded 2014

🤖 Artificial Intelligence

🤝 B2B

🔧 Hardware

🔥 Funding within the last year

💰 $329M Debt Financing - Vultr on 2025-06

Artificial Intelligence • B2B • Hardware

Vultr is a global cloud infrastructure provider offering on-demand virtual machines, bare-metal servers, GPU-accelerated instances, managed databases, object and block storage, Kubernetes, and networking services. The platform emphasizes AI and HPC workloads with a broad selection of AMD and NVIDIA GPUs, fast networking, and 32+ data center regions, plus a marketplace of deployable apps and developer-friendly APIs. Vultr targets developers and businesses seeking affordable, scalable, and compliant cloud compute and storage alternatives to hyperscalers.

📋 Description

• Architect large-scale GPU clusters within fixed site power budgets that optimizes for maximum GPU density while reserving necessary headroom for compute services, storage, and networking. • Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits). • Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies. • Determine network scale limits based on switch radix, link speed, topology, and blocking requirements. • Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms. • Develop power-aware cluster configuration templates and capacity-planning models that can scale across sites with varying constraints and allow for quick iteration and ideation. • Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management. • Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics. • Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs)

🎯 Requirements

• 7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters. • Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations. • Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design. • Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems. • Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance. • Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures. • Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count. • Strong documentation, communication, and cross-functional collaboration skills.

🏖️ Benefits

• Excellent Medical Benefits w/ 100% company-paid premiums for employee only plan + 100% company-paid dental & vision premiums • 401(k) plan that matches 100% up to 4% with immediate vesting • Professional Development Reimbursement of $2,500 each year • 11 Holidays + Paid Time Off Accrual + Rollover Plan + take your birthday off • Commitment matters to Vultr! Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year • $500 first year remote office setup + $400 each following year for new equipment • Internet reimbursement up to $75 per month • Gym membership reimbursement up to $50 per month • Company-paid Wellable subscription

Apply Now

Similar Jobs

🕒 February 25

SandboxAQ

51 - 200

🤖 Artificial Intelligence

🔒 Cybersecurity

💊 Pharmaceuticals

Staff Forward Deployed Engineer in AI Simulation developing solutions and ensuring client success at SandboxAQ. Join a global team tackling challenges in drug discovery and chemical simulation.

🕒 February 25

Game Plan Tech

51 - 200

🤖 Artificial Intelligence

🏛️ Government

🔒 Cybersecurity

AI Subject Matter Expert at Game Plan Tech advising on deployment and design of ML models. Focused on machine learning methodologies and generative AI techniques for innovative solutions.

🇺🇸 United States – Remote

💰 $550k Series B - GamePlan Technologies on 2013-10

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🕒 February 25

Jump - Advisor AI

51 - 200

🤖 Artificial Intelligence

💳 Fintech

☁️ SaaS

Applied AI Evaluation Scientist optimizing AI systems at Jump, a fintech startup leveraging LLMs. Focus on evaluation frameworks for AI/ML quality and trustworthiness.

🇺🇸 United States – Remote

💵 $180k - $270k / year

💰 $24.6M Series A - Jump on 2025-02

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🕒 February 25

HMH

1001 - 5000

📚 Education

🛍️ eCommerce

AI Delivery Lead coordinating AI integration across content operations at NWEA. Focusing on enhancing quality, speed, and efficiency in educational solutions.

🕒 February 24

Prolific

51 - 200

🤝 B2B

AI Trainer evaluating and improving cutting-edge AI models. Joining Prolific to assist in training AI with flexible hours and competitive pay.