AI Infrastructure Engineer

11 - 50 employees

🔧 Hardware

🏢 Enterprise

🤖 Artificial Intelligence

💰 $10M Seed Round on 2022-04

Hardware • Enterprise • Artificial Intelligence

Hydra Host is a provider of high-performance computing solutions, offering dedicated bare metal GPU server access optimized for AI and HPC workloads. Their platform allows users to access and rent top-tier GPUs globally, providing unparalleled performance, security, and customization. Hydra Host's infrastructure includes a marketplace, known as Brokkr, that offers a wide array of GPU configurations and solutions tailored for mission-critical applications such as AI, big data, and machine learning. Through their robust, secure, and scalable solutions, Hydra Host ensures customers enjoy full control over their server environments, with options for scalability and future-readiness. The company's offerings are trusted by leading firms seeking efficient and innovative computing solutions.

AI Infrastructure Engineer

🕒 February 10

🇺🇸 United States – Remote

💵 $150k - $225k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

👷 Infrastructure Engineer

Ansible

Cloud

Kubernetes

Linux

PyTorch

TCP/IP

Terraform

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Hydra Host

11 - 50 employees

🔧 Hardware

🏢 Enterprise

🤖 Artificial Intelligence

💰 $10M Seed Round on 2022-04

Hardware • Enterprise • Artificial Intelligence

📋 Description

• Get AI Platform customers production-ready on Hydra — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware. • Own the bare metal ←→ platform layer — bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use. • Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations. • Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types. • Identify gaps before customers do — pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken. • Turn customer learnings into product — working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding. • Advise customers on chip selection and tokenomics — helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.

🎯 Requirements

• Bare metal Linux depth — you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s. • NVIDIA GPU stack expertise — drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance. • Kubernetes and orchestration — production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them. • AI Networking fundamentals — TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads. • Customer-facing communication — you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team. • Bias toward scalable solutions — you'd rather build a feature that helps 10 customers than a custom deployment that helps 1. • Nice to Have HPC or large-scale distributed training environments. • AI workload experience (vLLM, PyTorch, inference frameworks). • Storage systems (NVMe, distributed filesystems, CEPH, WEKA). • IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS).

🏖️ Benefits

• Competitive salary • Equity ownership • Healthcare — medical, dental, vision for you and your family • Remote-first — with hubs in Phoenix, Boulder, and Miami • Direct impact — your work shapes how GPU infrastructure gets deployed across the AI ecosystem

Apply Now

Similar Jobs

Senior Cloud Data Infrastructure Engineer

🕒 February 2

ClickHouse

51 - 200

☁️ SaaS

🏢 Enterprise

🤖 Artificial Intelligence

Senior Cloud Data Infrastructure Engineer at ClickHouse building cloud-native database platforms. Collaborating on autoscaling solutions and enhancing cloud infrastructure performance.

🇺🇸 United States – Remote

💵 $133.4k - $197.2k / year

⏰ Full Time

🟠 Senior

👷 Infrastructure Engineer

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Distributed Systems

EC2

Google Cloud Platform

Java

Kafka

Kubernetes

Numpy

Pandas

Python

Spark

Lead Infrastructure Engineer

🕒 January 30

Atticus

51 - 200

Lead Infrastructure Engineer defining and shaping infrastructure at Atticus. Working closely with product teams to develop necessary platforms and tools for efficient operations.

🇺🇸 United States – Remote

💵 $160k - $200k / year

⏰ Full Time

🟠 Senior

👷 Infrastructure Engineer

Cloud

Google Cloud Platform

Terraform

Senior Infrastructure Engineer – Security

🕒 January 15

Superlanet

51 - 200

⚕️ Healthcare Insurance

🎯 Recruiter

Senior Infrastructure Engineer designing and securing enterprise infrastructure for healthcare client in Texas. Responsible for ensuring system stability, resiliency, and security across hybrid and cloud environments.

🇺🇸 United States – Remote

💵 $130k - $145k / year

⏰ Full Time

🟠 Senior

👷 Infrastructure Engineer

Ansible

Cloud

DNS

Linux

Python

VMware

Senior Data Infrastructure Engineer

🕒 January 15

Lyric - Clarity in motion.

201 - 500

⚕️ Healthcare Insurance

💳 Fintech

☁️ SaaS

Senior Data Infrastructure Engineer responsible for designing and scaling data platforms in AWS for healthcare company. Collaborating with teams on cloud data solutions and optimizing performance, security, and operations.

🇺🇸 United States – Remote

💵 $122.4k - $183.6k / year

⏰ Full Time

🟠 Senior

👷 Infrastructure Engineer

Airflow

AWS

Azure

Cassandra

Cloud

Google Cloud Platform

Kafka

Oracle

Postgres

Python

SQL

Terraform

Security Infrastructure Engineer

🕒 January 13

Tailscale

51 - 200

☁️ SaaS

🔐 Security

📡 Telecommunications