Senior Site Reliability Engineer, Core Cloud Engineering

November 5

Apply Now
Logo of Vultr

Vultr

Cloud Computing • Artificial Intelligence

Vultr is a cloud infrastructure provider offering a wide range of services including compute instances, storage, managed databases, and GPU clusters. The company focuses on providing high-performance and accessible cloud solutions, leveraging both AMD and NVIDIA technologies to power applications in artificial intelligence, high-performance computing, and general workloads. Vultr offers services that are designed to be simpler and more cost-effective than major competitors like AWS, GCP, and Azure, with global data center locations to support diverse deployment needs.

51 - 200 employees

Founded 2014

🤖 Artificial Intelligence

📋 Description

• Operate and scale Vultr's control plane, ensuring availability, correctness, and performance across global datacenters. • Design, implement, and maintain automation to manage hypervisor fleets (KVM, QEMU, libvirt) and supporting infrastructure at scale. • Develop tooling and automation for Open vSwitch (OVS), BGP routing, and other networking components to ensure resilient and self-healing network operations. • Continuously analyze and improve system performance across compute, storage, and network layers, with an emphasis on reducing toil and eliminating single points of failure. • Implement advanced monitoring, logging, and tracing solutions (Grafana, Sentry, SumoLogic) while leading incident response to minimize impact and drive postmortem culture. • Maintain and evolve infrastructure pipelines (GitLab CI/CD, Puppet) to enable safe, fast, and reliable changes to both control plane and hypervisor infrastructure. • Work closely with Software Engineers, Network Engineers, and Product teams to align platform reliability with business and user needs. • Produce clear technical documentation for runbooks, operational procedures, and automation frameworks to improve team efficiency and reliability standards. • Coach and mentor team members in best practices for site reliability, incident handling, automation, and low-level Linux systems debugging.

🎯 Requirements

• Proficiency in PHP with strong scripting and automation skills. • Experience running large-scale distributed systems and control plane infrastructure in production. • Strong background in hypervisor technologies (libvirt, QEMU, KVM) and Linux systems administration. • Expertise in networking protocols and tools, particularly BGP and Open vSwitch (OVS), with automation experience. • Deep knowledge of observability and monitoring frameworks (Grafana, Sentry, SumoLogic) and incident management. • Advanced troubleshooting skills across compute, networking, and storage subsystems. • Experience building and maintaining CI/CD pipelines (GitLab) and configuration management (Puppet). • Familiarity with MySQL or similar databases, with an understanding of operational considerations for reliability and scale. • Strong problem-solving abilities and the drive to tackle complex, low-level reliability challenges. • Effective cross-team communication and collaboration skills. • A commitment to continuous improvement and fostering a culture of operational excellence.

🏖️ Benefits

• 100% company-paid insurance premiums for employee medical, dental and vision plans. • 401(k) plan that matches 100% up to 4%, with immediate vesting • Professional Development Reimbursement of $2,500 each year • 11 Holidays + Paid Time Off Accrual + Rollover Plan • Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year • $500 stipend for remote office setup in first year + $400 each following year • Internet reimbursement up to $75 per month • Gym membership reimbursement up to $50 per month • Company paid Wellable subscription

Apply Now

Similar Jobs

November 5

Site Reliability Engineer at Leidos ensuring systems meet reliability standards for the US Space Force. Developing test plans and risk management strategies in a hybrid Microsoft Azure environment.

Azure

Cloud

November 5

Kindred

1001 - 5000

🤝 B2B

Site Reliability Engineer developing AWS infrastructure for a community-driven home swapping network. Leading cloud architecture and enhancing developer productivity with internal tools.

AWS

Cloud

Docker

EC2

JavaScript

Kubernetes

Python

Terraform

TypeScript

November 5

Kindred

1001 - 5000

🤝 B2B

Cloud Operations Engineer specializing in AWS infrastructure for a members-only home swapping network. Leading infrastructure decisions and ensuring scalable and robust cloud architecture.

AWS

Cloud

Docker

EC2

JavaScript

Kubernetes

Python

Terraform

TypeScript

November 4

Cribl

501 - 1000

☁️ SaaS

Senior Site Reliability Engineer unlocking the value of observability data for Cribl. Engaging with teams to improve service delivery and reliability in cloud environments.

Ansible

AWS

Chef

Cloud

Grafana

JavaScript

Linux

Node.js

Prometheus

Puppet

Splunk

Terraform

TypeScript

November 4

Senior DevOps Engineer managing Copado CI/CD pipelines and Salesforce environments for ICF, collaborating on solutions with clients.

Cloud

RPA

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com