AI Infrastructure Engineer

🕒 February 10

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Hydra Host

Hydra Host

11 - 50 employees

🔧 Hardware

🏢 Enterprise

🤖 Artificial Intelligence

💰 $10M Seed Round on 2022-04

Hardware • Enterprise • Artificial Intelligence

Hydra Host is a provider of high-performance computing solutions, offering dedicated bare metal GPU server access optimized for AI and HPC workloads. Their platform allows users to access and rent top-tier GPUs globally, providing unparalleled performance, security, and customization. Hydra Host's infrastructure includes a marketplace, known as Brokkr, that offers a wide array of GPU configurations and solutions tailored for mission-critical applications such as AI, big data, and machine learning. Through their robust, secure, and scalable solutions, Hydra Host ensures customers enjoy full control over their server environments, with options for scalability and future-readiness. The company's offerings are trusted by leading firms seeking efficient and innovative computing solutions.

📋 Description

• Get AI Platform customers production-ready on Hydra — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware. • Own the bare metal ←→ platform layer — bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use. • Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations. • Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types. • Identify gaps before customers do — pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken. • Turn customer learnings into product — working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding. • Advise customers on chip selection and tokenomics — helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.

🎯 Requirements

• Bare metal Linux depth — you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s. • NVIDIA GPU stack expertise — drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance. • Kubernetes and orchestration — production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them. • AI Networking fundamentals — TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads. • Customer-facing communication — you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team. • Bias toward scalable solutions — you'd rather build a feature that helps 10 customers than a custom deployment that helps 1. • Nice to Have HPC or large-scale distributed training environments. • AI workload experience (vLLM, PyTorch, inference frameworks). • Storage systems (NVMe, distributed filesystems, CEPH, WEKA). • IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS).

🏖️ Benefits

• Competitive salary • Equity ownership • Healthcare — medical, dental, vision for you and your family • Remote-first — with hubs in Phoenix, Boulder, and Miami • Direct impact — your work shapes how GPU infrastructure gets deployed across the AI ecosystem

Apply Now

Similar Jobs

🕒 February 6

Telnyx

201 - 500

📡 Telecommunications

☁️ SaaS

🤖 Artificial Intelligence

Infrastructure Engineer managing Kubernetes clusters and enhancing networking security for a serverless edge compute platform at Telnyx.

Ansible

Cloud

Firewalls

Kubernetes

Linux

Prometheus

Terraform

🕒 February 2

ClickHouse

51 - 200

☁️ SaaS

🏢 Enterprise

🤖 Artificial Intelligence

Senior Cloud Data Infrastructure Engineer at ClickHouse building cloud-native database platforms. Collaborating on autoscaling solutions and enhancing cloud infrastructure performance.

AWS

Azure

Cloud

Distributed Systems

EC2

Google Cloud Platform

Java

Kafka

Kubernetes

Numpy

Pandas

Python

Spark

Go

🕒 January 30

Atticus

51 - 200

Lead Infrastructure Engineer defining and shaping infrastructure at Atticus. Working closely with product teams to develop necessary platforms and tools for efficient operations.

Cloud

Google Cloud Platform

Terraform

🕒 January 29

Earth Species Project

1 - 10

🤖 Artificial Intelligence

🔬 Science

🤝 Non-profit

Senior Infrastructure Engineer for Earth Species Project. Designing scalable AI data pipelines to decode animal communication with advanced AI and supporting infrastructure team growth.

Apache

AWS

Azure

BigQuery

Cloud

Distributed Systems

Docker

Google Cloud Platform

Kubernetes

Python

PyTorch

Spark

Terraform

🕒 January 15

Superlanet

51 - 200

⚕️ Healthcare Insurance

🎯 Recruiter

Senior Infrastructure Engineer designing and securing enterprise infrastructure for healthcare client in Texas. Responsible for ensuring system stability, resiliency, and security across hybrid and cloud environments.

Ansible

Cloud

DNS

Linux

Python

VMware