Manager, Network DevOps

Cloud Computing • Artificial Intelligence

Vultr is a cloud infrastructure provider offering a wide range of services including compute instances, storage, managed databases, and GPU clusters. The company focuses on providing high-performance and accessible cloud solutions, leveraging both AMD and NVIDIA technologies to power applications in artificial intelligence, high-performance computing, and general workloads. Vultr offers services that are designed to be simpler and more cost-effective than major competitors like AWS, GCP, and Azure, with global data center locations to support diverse deployment needs.

51 - 200 employees

Founded 2014

🤖 Artificial Intelligence

Manager, Network DevOps

Yesterday

🇺🇸 United States – Remote

💵 $140k - $150k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

Cloud

Distributed Systems

Grafana

Kafka

Linux

Prometheus

Python

Rust

Switching

Apply Now

Vultr

Cloud Computing • Artificial Intelligence

51 - 200 employees

Founded 2014

🤖 Artificial Intelligence

📋 Description

• Own the NetDevOps roadmap — spanning automation, observability, configuration validation, telemetry ingestion, and operational tooling for the global network. • Manage and grow a high-performing team of NetDevOps Engineers, providing technical guidance, career development, and hands-on mentorship. • Drive automation for complex environments, including EVPN-VXLAN data center fabrics, RoCEv2 lossless Ethernet, and global WAN/edge infrastructure. • Build and evolve operator tooling for Network Operations (Tier 1/2) including event correlation, intent validation, playbooks, and automated remediation workflows. • Ensure operational excellence across fleet-wide updates, config management, CI/CD pipelines, and reliability metrics for automation systems. • Partner closely with Cloud Networking (who own front-end networking, VPC automation, dataplane behavior) to unify automation interfaces and ensure clean separation of responsibilities. • Collaborate with Architecture, Platform, and GPU/AI Engineering on next-generation fabric design, automation hooks, observability, and provisioning flows. • Standardize telemetry ingestion and correlation pipelines (gNMI, Kafka, Prometheus, custom collectors) to generate actionable, real-time insights into network behavior. • Lead complex investigations across routing, switching, RDMA transport behavior, congestion, ECMP, and overlay/underlay interactions, especially where tooling or automation must evolve. • Define engineering standards, SLIs/SLOs for automation services, and operational maturity goals (testing, documentation, failure modes).

🎯 Requirements

• Strong experience building and leading high-performing engineering teams (NetDevOps, SRE, automation, or network engineering groups). • Deep understanding of modern data center networking: EVPN-VXLAN, BGP, QoS, telemetry, and config automation. • Familiarity with RoCEv2/RDMA fabrics, PFC/ECN tuning, congestion management, or GPU/AI fabric operations. • Hands-on experience with automation ecosystems - Ansible, Python, Go, Rust, CI/CD pipelines, config linting, and intent validation frameworks. • Experience integrating automation with a Source-of-Truth (NetBox, Nautobot, OpsMill, homegrown systems). • Strong understanding of telemetry and monitoring stacks (Prometheus/Grafana, Kafka, OpenTelemetry, custom collectors). • Ability to dive deep into Linux networking internals, namespaces, netlink, and distributed systems behavior. • Proven experience delivering reliable automation services at scale, with strong fundamentals in testing, versioning, rollback, and change management.

🏖️ Benefits

• 100% company-paid insurance premiums for employee medical, dental and vision plans. • 401(k) plan that matches 100% up to 4%, with immediate vesting • Professional Development Reimbursement of $2,500 each year • 11 Holidays + Paid Time Off Accrual + Rollover Plan • Commitment matters to Vultr! Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year • $500 stipend for remote office setup in first year + $400 each following year • Internet reimbursement up to $75 per month • Gym membership reimbursement up to $50 per month • Company paid Wellable subscription

Apply Now

Similar Jobs

Release Engineer, Technical Writer

Yesterday

requisimus

51 - 200

🤝 B2B

🏢 Enterprise

PREEvision Release Engineer & Technical Writer at requisimus managing document exports and quality assurance processes for IT consulting projects. Collaborating on various projects in an open multicultural team.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇩🇪 German Required

Python

VBA

Senior DevOps Engineer

Yesterday

Resilience

51 - 200

🔒 Cybersecurity

🏢 Enterprise

Senior DevOps Engineer optimizing cloud infrastructure operations for leading cybersecurity firm. Collaborating with scrum teams, managing CI/CD processes, and maintaining cloud infrastructure across providers.

🇺🇸 United States – Remote

💵 $130k - $150k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Google Cloud Platform

Jenkins

Kubernetes

Prometheus

Terraform

Product Owner – DevOps

Yesterday

Velera

1001 - 5000

💳 Fintech

🏦 Banking

Product Owner role maximizing agile team value at Velera, a fintech solutions provider for credit unions. Overseeing product vision, backlogs, and ensuring high-quality delivery.

🇺🇸 United States – Remote

💵 $95.8k - $124.5k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Azure

Senior Lead Site Reliability Engineer

Yesterday

Akamai Technologies

5001 - 10000

🔒 Cybersecurity

Responsible for ensuring the optimal performance and up-time of Akamai's critical security products. Analyzing system performance and developing tools for monitoring and alerting.

🇺🇸 United States – Remote

💵 $106.6k - $221.4k / year

💰 Post-IPO Equity on 2001-07

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Azure

Cloud

Distributed Systems

Jenkins

Kubernetes

Python

Terraform

Site Reliability Engineer, Monitoring and Control Engineering

2 days ago

NBCUniversal

10,000+ employees

📱 Media

Site Reliability Engineer responsible for NBCU's Distribution Engineering monitoring and control systems. Utilizing automation and on-call support, to ensure high availability.

🇺🇸 United States – Remote

💵 $110k - $145k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

AWS

Azure

Chef

Cloud

Docker

Google Cloud Platform

Grafana

Kubernetes

Linux

Node.js

Python

React

SaltStack

Splunk

Terraform

TypeScript