Network DevOps Engineer, RDMA Fabric Automation

Cloud Computing • Artificial Intelligence

Vultr is a cloud infrastructure provider offering a wide range of services including compute instances, storage, managed databases, and GPU clusters. The company focuses on providing high-performance and accessible cloud solutions, leveraging both AMD and NVIDIA technologies to power applications in artificial intelligence, high-performance computing, and general workloads. Vultr offers services that are designed to be simpler and more cost-effective than major competitors like AWS, GCP, and Azure, with global data center locations to support diverse deployment needs.

51 - 200 employees

Founded 2014

🤖 Artificial Intelligence

Network DevOps Engineer, RDMA Fabric Automation

Yesterday

🇺🇸 United States – Remote

💵 $85k - $100k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

Grafana

Jenkins

Kafka

Linux

PHP

Prometheus

Python

Rust

Apply Now

Vultr

Cloud Computing • Artificial Intelligence

51 - 200 employees

Founded 2014

🤖 Artificial Intelligence

📋 Description

• Automate deployment and operations of large-scale RDMA (RoCEv2) Ethernet fabrics across Vultr data centers. • Build Ansible and Python-based frameworks to provision, validate, and remediate underlay and overlay networks. • Integrate network automation with Vultr’s source-of-truth systems (NetBox, OpsMill) for intent-driven configuration and validation. • Develop telemetry ingestion and correlation pipelines (gNMI, Prometheus, Kafka, custom collectors) for real-time network health and performance metrics. • Collaborate with platform, orchestration, and product engineering teams to optimize RDMA performance, PFC/ECN behavior, and path symmetry across fabrics. • Implement CI/CD workflows for network configuration changes — validation, pre-checks, and rollbacks. • Investigate complex network behaviors across layers — flow hashing, congestion domains, ECMP, and overlay interactions. • Contribute to the design of next-generation GPU and AI interconnect fabrics, ensuring seamless integration into Vultr’s global network architecture.

🎯 Requirements

• Solid understanding of modern data center networking: EVPN-VXLAN, BGP, MLAG, QoS, and traffic engineering. • Deep familiarity with RoCEv2, RDMA transport tuning, ECN/PFC, and lossless Ethernet design. • Strong experience with automation frameworks like Ansible, and languages like Python, Golang, Rust, or PHP • Comfort working with telemetry and monitoring stacks — Prometheus, Grafana, Loki, ELK, or similar. • Previous experience integrating with NetBox, Nautobot, OpsMill or similar for topology and configuration source-of-truth. • Familiarity with CI/CD systems (GitHub Actions, Jenkins, ArgoCD) for continuous delivery of network automation. • Strong Linux networking background, including namespaces, netlink, and system-level debugging.

🏖️ Benefits

• 100% company-paid insurance premiums for employee medical, dental and vision plans. • 401(k) plan that matches 100% up to 4%, with immediate vesting • Professional Development Reimbursement of $2,500 each year • 11 Holidays + Paid Time Off Accrual + Rollover Plan • Commitment matters to Vultr! Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year • $500 stipend for remote office setup in first year + $400 each following year • Internet reimbursement up to $75 per month • Gym membership reimbursement up to $50 per month • Company paid Wellable subscription

Apply Now

Similar Jobs

Release Engineer, Technical Writer

Yesterday

requisimus

51 - 200

🤝 B2B

🏢 Enterprise

PREEvision Release Engineer & Technical Writer at requisimus managing document exports and quality assurance processes for IT consulting projects. Collaborating on various projects in an open multicultural team.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇩🇪 German Required

Python

VBA

Senior DevOps Engineer

Yesterday

Resilience

51 - 200

🔒 Cybersecurity

🏢 Enterprise

Senior DevOps Engineer optimizing cloud infrastructure operations for leading cybersecurity firm. Collaborating with scrum teams, managing CI/CD processes, and maintaining cloud infrastructure across providers.

🇺🇸 United States – Remote

💵 $130k - $150k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Google Cloud Platform

Jenkins

Kubernetes

Prometheus

Terraform

Product Owner – DevOps

Yesterday

Velera

1001 - 5000

💳 Fintech

🏦 Banking

Product Owner role maximizing agile team value at Velera, a fintech solutions provider for credit unions. Overseeing product vision, backlogs, and ensuring high-quality delivery.

🇺🇸 United States – Remote

💵 $95.8k - $124.5k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Azure

Senior Lead Site Reliability Engineer

Yesterday

Akamai Technologies

5001 - 10000

🔒 Cybersecurity

Responsible for ensuring the optimal performance and up-time of Akamai's critical security products. Analyzing system performance and developing tools for monitoring and alerting.

🇺🇸 United States – Remote

💵 $106.6k - $221.4k / year

💰 Post-IPO Equity on 2001-07

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Azure

Cloud

Distributed Systems

Jenkins

Kubernetes

Python

Terraform

Site Reliability Engineer, Monitoring and Control Engineering

2 days ago

NBCUniversal

10,000+ employees

📱 Media

Site Reliability Engineer responsible for NBCU's Distribution Engineering monitoring and control systems. Utilizing automation and on-call support, to ensure high availability.

🇺🇸 United States – Remote

💵 $110k - $145k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

AWS

Azure

Chef

Cloud

Docker

Google Cloud Platform

Grafana

Kubernetes

Linux

Node.js

Python

React

SaltStack

Splunk

Terraform

TypeScript