Director – AI Infrastructure

Job not on LinkedIn

2 days ago

Apply Now
Logo of Vultr

Vultr

Cloud Computing • Artificial Intelligence

Vultr is a cloud infrastructure provider offering a wide range of services including compute instances, storage, managed databases, and GPU clusters. The company focuses on providing high-performance and accessible cloud solutions, leveraging both AMD and NVIDIA technologies to power applications in artificial intelligence, high-performance computing, and general workloads. Vultr offers services that are designed to be simpler and more cost-effective than major competitors like AWS, GCP, and Azure, with global data center locations to support diverse deployment needs.

51 - 200 employees

Founded 2014

🤖 Artificial Intelligence

📋 Description

• Lead engineering team responsible for implementation, scaling, and operation of AI compute clusters. • Partner with senior leadership to shape infrastructure roadmaps and translate high-level strategy into actionable engineering plans. • Drive execution of cluster deployments, hardware bring-up, node-level configuration, integration and validation with orchestration systems. • Oversee engineering efforts related to GPU fleet growth, networking design integration, and storage systems that support ML and high-performance workloads. • Ensure reliability, performance, and availability of the infrastructure through monitoring, automation, and well-defined operational processes. • Work with AI/ML teams, SRE, networking, and hardware engineering to align cluster capabilities with training and inference requirements. • Improve provisioning, configuration management, and lifecycle operations for large bare metal and GPU fleets. • Contribute to the design of multi-tenant scheduling, workload management, and resource orchestration, in partnership with the cluster architect. • Manage technical incident response and proactively identify areas for performance improvement or architectural refinement. • Mentor and grow engineering talent, fostering a high-performance, detail-oriented culture. • Collaborate closely with Product to clarify requirements, delivery timelines, and customer-facing capabilities.

🎯 Requirements

• Extensive experience (typically 8–12 years) in infrastructure engineering, HPC, large-scale systems, or similar domains. • Strong understanding of AI compute infrastructure, including GPU/CPU clusters, distributed training workflows, and high-performance networking (InfiniBand/RDMA). • Experience running production bare metal platforms or hardware fleets at meaningful scale. • Technical depth in Linux systems, Kubernetes or Slurm, provisioning tools (Terraform, Ansible), observability stacks, and networking fundamentals. • Hands-on experience with cluster operations, hardware bring-up, distributed systems, or ML workload scaling. • Demonstrated ability to lead and develop engineering teams while staying close to the technical details. • Excellent cross-functional communication and ability to partner with architecture, AI/ML, SRE, Networking, and Infrastructure Operations teams. • Strong execution mindset with the ability to translate strategic goals into measurable engineering deliverables.

🏖️ Benefits

• 100% company-paid insurance premiums for employee medical, dental and vision plans. • 401(k) plan that matches 100% up to 4%, with immediate vesting • Professional Development Reimbursement of $2,500 each year • 11 Holidays + Paid Time Off Accrual + Rollover Plan • Commitment matters to Vultr! Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year • $500 stipend for remote office setup in first year + $400 each following year • Internet reimbursement up to $75 per month • Gym membership reimbursement up to $50 per month • Company paid Wellable subscription

Apply Now

Similar Jobs

October 9

Trellis

51 - 200

🛍️ eCommerce

🤝 B2B

☁️ SaaS

AI/ML Engineer working on backend services and data analytics for Trellis, a legal data company. Designing data architecture and features for high-speed, large data environments.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

🗣️ LLM Engineer

🦅 H1B Visa Sponsor

August 24

Gartner

10,000+ employees

🏢 Enterprise

Senior Director advising on AI infrastructure for enterprises at Gartner, producing research and client guidance.

🇺🇸 United States – Remote

💵 $152k - $190k / year

⏰ Full Time

🟠 Senior

🔴 Lead

🗣️ LLM Engineer

🦅 H1B Visa Sponsor

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com