Senior Reliability Operations Engineer

Job not on LinkedIn

🔥 9 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Serve Robotics

Serve Robotics

51 - 200 employees

Founded 2017

🚗 Transport

🤖 Artificial Intelligence

💰 $30M Venture Round on 2023-08

Transport • Artificial Intelligence

Serve Robotics is an innovative company focused on revolutionizing the delivery industry with its autonomous delivery robots. The company aims to make delivery services more affordable, sustainable, and convenient by using self-driving robots instead of traditional two-ton vehicles for small deliveries like burritos. Through a commercial deal with Uber, Serve Robotics plans to deploy up to 2,000 robots, marking a significant advancement in the autonomous delivery sector.

📋 Description

• Serve as the primary incident lead during your region’s daytime hours, coordinating technical investigations, centralizing communication, and engaging the appropriate engineering and SRE teams when escalation is required. • Respond to escalations from Tier 1 support, using runbooks, metrics, logs, and system diagnostics to investigate and remediate issues or determine when escalation to Tier 3 is necessary. • Develop and update runbooks, workflows, and operational documentation to ensure consistent and reliable responses to recurring issues, collaborating with product teams to expand coverage over time. • Write, maintain, and enhance automation scripts and tools that streamline common remediation steps, improve response times, and reduce manual operational overhead. • Use metrics, logs, and tracing tools (Grafana/Prometheus, GCP Monitoring, OpenTelemetry) to proactively identify problems, validate system behavior, and support continuous improvement of detection mechanisms. • Act as the central point of communication during active incidents, ensuring timely updates and clear routing to the correct product engineering and SRE stakeholders. • Collaborate with reliability and product teams to share insights, recommend improvements, and help refine processes that enhance the stability and operability of our systems. • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise. • Help establish operational best practices, refine workflows, and prepare the foundation for a broader reliability operations function.

🎯 Requirements

• Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience. • 5+ years of professional experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a related technical support function. • Demonstrated experience owning or participating in Tier 2 or Tier 3 technical investigations, including triage, log analysis, and structured escalation. • Experience supporting distributed systems, cloud-hosted services, or production operational environments. • Hands-on experience participating in incident response processes. • Strong proficiency with Linux, including navigating systems, reviewing logs, and performing diagnostics. • Experience writing, executing, and maintaining runbooks, automations, and operational workflows. • Ability to interpret metrics, logs, and traces using tools such as Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry. • Familiarity with modern cloud environments, preferably Google Cloud Platform (GCP), including basic debugging, permissions, and service-level triage. • Ability to investigate and remediate issues following documented procedures, escalating effectively when needed. • Understanding of CI/CD pipelines, deployed application behavior, and operational dependencies across microservices. • Proficiency with Jira or similar platforms for ticketing and structured incident tracking. • Exceptional communication skills, especially during high-pressure incidents where clear, concise updates are critical. • Calm and methodical approach to troubleshooting, prioritization, and decision-making. • Strong collaboration skills when coordinating with product engineering, SRE, and global support teams. • High level of ownership, reliability, and accountability when handling operational responsibilities and incident leadership.

Apply Now

Similar Jobs

🕒 2 days ago

Pave Bank

51 - 200

Site Reliability Engineer ensuring high availability and performance of production systems at Pave Bank. Collaborating with teams for infrastructure reliability in a fintech environment.

🇲🇾 Malaysia – Remote

🔥 Funding within the last year

💰 $39M Series A - Pave Bank on 2025-10

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Distributed Systems

Docker

Google Cloud Platform

Grafana

Kubernetes

Microservices

Prometheus

Python

Terraform

Go

🕒 5 days ago

pod network

1 - 10

🌐 Web 3

Site Reliability Engineer improving and scaling the reliability of the Pod platform, focusing on incident response and operational tooling.

Cloud

Distributed Systems

Docker

Grafana

Linux

Prometheus

Python

Rust

🕒 June 11

Unit4

1001 - 5000

🏢 Enterprise

☁️ SaaS

🤖 Artificial Intelligence

Cloud Operations Engineer at Unit4 solving customer business processing issues and building better solutions with skills in Azure, DevOps, and troubleshooting.

Azure

Cloud

SMTP

SQL

🕒 April 24

LineTen

51 - 200

🛍️ eCommerce

☁️ SaaS

🚗 Transport

Site Reliability Engineer joining LineTen to ensure global coverage of our products. Responsible for engineering support and development experience using Docker and Kubernetes.

Cloud

Docker

Kubernetes

🕒 April 15

Arize AI

51 - 200

🤖 Artificial Intelligence

☁️ SaaS

🏢 Enterprise

Senior DevOps Engineer optimizing infrastructure for SaaS and on-prem AI services at Arize. Collaborates with customers and product teams to enhance performance and reliability.

AWS

Azure

Cloud

Google Cloud Platform

Kubernetes