Infrastructure Engineer, Observability

August 23

Apply Now
Logo of Voltage Park

Voltage Park

Artificial Intelligence • Enterprise

Voltage Park is a company that provides high-performance AI compute infrastructure. They offer bare metal access, transparent pricing, and exceptional customer service for demanding workloads that rely on advanced hardware such as NVIDIA HGX H100 GPUs and state-of-the-art data centers. Voltage Park is committed to delivering fast, flexible, and scalable compute solutions, with a focus on AI training, model fine-tuning, and real-time inference. Security and compliance are paramount, with top-tier firewalls and rigorous security protocols in place. Their infrastructure is designed for reliability, leveraging high-speed networks and advanced data centers to ensure top-notch performance and support for their customers.

đź“‹ Description

• Design and operate systems managing thousands of bare-metal servers, GPUs, and high-performance networks across multiple data centers • Design, build, and maintain observability platforms spanning metrics, logs, traces, and events • Create dashboards and alerting for internal stakeholders and scoped visibility for external customers • Ingest and correlate telemetry from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish • Implement noise-resistant alerting pipelines that improve detection and reduce operational load • Collaborate with infrastructure, platform, and customer-facing teams to embed observability into workflows • Contribute to broader infrastructure engineering projects beyond observability • Fully remote position requiring candidates to be based in the continental United States; no visa sponsorship

🎯 Requirements

• 8+ years in infrastructure engineering, SRE, or observability roles • Strong experience with monitoring systems (Prometheus, Grafana, ELK, VictoriaMetrics, or similar) • Proficiency in Python, Go, or bash for automation and data integration • Familiarity with container/Kubernetes observability • Understanding of streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent) • Strong written and verbal communication skills • Experience with GPU observability, particularly NVIDIA DCGM (ideal) • Designing multi-tenant observability solutions with RBAC and scoped queries (ideal) • Prior work with correlation engines for RCA, forecasting, or predictive alerting (ideal) • Broader exposure to infrastructure domains (networking, storage, provisioning) (ideal)

🏖️ Benefits

• Offers Equity • Offers Bonus • Full benefits • 5% 401k match • Comprehensive health insurance with 100% of premiums covered by Voltage Park

Apply Now

Similar Jobs

August 20

AWS Infrastructure Developer implementing and automating AWS infrastructure for Priority IDC. Responsible for IaC, CI/CD, monitoring, cost optimization, and security.

Ansible

AWS

Chef

Cloud

EC2

Grafana

Jenkins

Prometheus

Puppet

Python

Terraform

August 19

Builds and automates IT infrastructure at Business Wire; supports 24x7 operations and on-call rotations.

Ansible

AWS

Cloud

Linux

Python

TCP/IP

Terraform

VMware

Go

August 14

Crypto infra engineer at Unit 410.\nLaunch secure networks; build scalable infra.

Ansible

AWS

Cloud

Google Cloud Platform

Grafana

Prometheus

Rust

Terraform

TypeScript

Go

August 8

Join Roboflow as an Infrastructure Engineer, ensuring robust cloud security and reliability.

AWS

Cloud

Docker

Google Cloud Platform

JavaScript

Kubernetes

Microservices

Node.js

Open Source

Python

Switching

Terraform

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com