Staff AI/ML Infrastructure Engineer

Job not on LinkedIn

🕒 April 14

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Vultr

Vultr

201 - 500 employees

Founded 2014

🤖 Artificial Intelligence

🤝 B2B

🔧 Hardware

🔥 Funding within the last year

💰 $329M Debt Financing - Vultr on 2025-06

Artificial Intelligence • B2B • Hardware

Vultr is a global cloud infrastructure provider offering on-demand virtual machines, bare-metal servers, GPU-accelerated instances, managed databases, object and block storage, Kubernetes, and networking services. The platform emphasizes AI and HPC workloads with a broad selection of AMD and NVIDIA GPUs, fast networking, and 32+ data center regions, plus a marketplace of deployable apps and developer-friendly APIs. Vultr targets developers and businesses seeking affordable, scalable, and compliant cloud compute and storage alternatives to hyperscalers.

📋 Description

• Design and maintain GPU and bare metal infrastructure in containerized and physical environments • Build scalable GPU clusters in partnership with networking and provisioning teams • Ensure reliable, high-performance provisioning of GPU infrastructure • Develop automated testing systems for GPU-based platforms • Implement infrastructure solutions for diverse AI/ML workloads • Benchmark, test, and troubleshoot GPU performance at scale • Collaborate with hardware vendors on drivers, firmware, and support • Resolve hardware, software, and performance issues across environments • Optimize rail and cluster performance across architectures • Lead technical direction and mentor engineers on infrastructure best practices

🎯 Requirements

• 5+ years experience working with bare metal infrastructure and hardware automation • Hands-on experience with modern NVIDIA/AMD GPU platforms and high-performance networking (RoCE, InfiniBand) • Deep knowledge of BIOS, BMC, firmware, NICs, Redfish/IPMI, and PCIe systems • Strong Linux systems experience including device drivers and package management • Experience building infrastructure automation using Python and Bash • Familiarity with GPU drivers, firmware ecosystems, and vendor collaboration • Experience designing and delivering complex infrastructure products • Proven ability to lead projects and mentor engineers • Experience optimizing multi-cluster GPU environments • Exposure to Machine Learning software stacks and GPU workloads

🏖️ Benefits

• 100% company-paid insurance premiums for employee medical, dental and vision plans. • 401(k) plan that matches 100% up to 4%, with immediate vesting • Professional Development Reimbursement of $2,500 each year • 11 Holidays + Paid Time Off Accrual + Rollover Plan • Commitment matters to Vultr! Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year • $500 stipend for remote office setup in first year + $400 each following year • Internet reimbursement up to $75 per month • Gym membership reimbursement up to $50 per month • Company paid Wellable subscription

Apply Now

Similar Jobs

🕒 March 20

Voxel51

11 - 50

🤖 Artificial Intelligence

Principal Infrastructure Engineer at Voxel51 designing systems for managing unstructured data. Leading architecture and strategy for AI work deployment across industry verticals.

Ansible

AWS

Azure

Cloud

Distributed Systems

Docker

Google Cloud Platform

Kubernetes

MongoDB

NoSQL

Python

Terraform

🕒 March 4

LexisNexis

10,000+ employees

📋 Compliance

🏛️ Government

☁️ SaaS

Consulting AWS Cloud Network Infrastructure Engineer at LexisNexis defining best practices and collaborating on cloud initiatives. Architecting secure and scalable cloud solutions to enable business agility.

AWS

Cloud

DNS

EC2

Firewalls

Python

Splunk

Terraform

🕒 February 26

Epistemix

11 - 50

☁️ SaaS

🤖 Artificial Intelligence

🤝 B2B

Infrastructure Architect responsible for client integration and deployment of Epistemix's data-driven platform. Requires deep technical expertise in cloud environments and automation tools.

AWS

Cloud

Django

Docker

Google Cloud Platform

Kubernetes

Python

Terraform

🕒 February 12

Unisys

10,000+ employees

🤖 Artificial Intelligence

🔒 Cybersecurity

Engineering Manager overseeing engineering team creating secure and scalable tech solutions for Unisys. Leading automation efforts and collaborating with stakeholders to ensure platform efficiency and compliance.

Ansible

Azure

Linux

Microservices

Prometheus

ServiceNow

Terraform

Vault

🕒 January 23

Obvious

11 - 50

☁️ SaaS

⚡ Productivity

🏢 Enterprise

Infrastructure Engineer responsible for scalable AI-native systems at Obvious. Optimizing CI/CD and Kubernetes deployment for developer productivity.

AWS

Kubernetes

Postgres

Terraform