Reliability Operations Engineer

Job not on LinkedIn

🔥 8 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Serve Robotics

Serve Robotics

51 - 200 employees

Founded 2017

🚗 Transport

🤖 Artificial Intelligence

💰 $30M Venture Round on 2023-08

Transport • Artificial Intelligence

Serve Robotics is an innovative company focused on revolutionizing the delivery industry with its autonomous delivery robots. The company aims to make delivery services more affordable, sustainable, and convenient by using self-driving robots instead of traditional two-ton vehicles for small deliveries like burritos. Through a commercial deal with Uber, Serve Robotics plans to deploy up to 2,000 robots, marking a significant advancement in the autonomous delivery sector.

📋 Description

• Lead incident investigations during your region’s daytime hours, providing timely updates, escalating appropriately, and supporting senior engineers leading the response. • Respond to escalations from Tier 1 support using established runbooks, metrics, logs, and diagnostics to remediate issues or escalate to Tier 3 when needed. • Update runbooks and operational documentation based on new issues, discoveries, and feedback, ensuring clarity and consistency across all procedures. • Run existing automations and collaborate with senior team members to enhance tooling and scripts that streamline troubleshooting and remediation tasks • Use observability tools such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to interpret metrics, logs, and traces, helping identify anomalies and validate system performance. • Provide concise, accurate updates during incidents, ensuring information reaches the correct engineering and SRE contacts and supporting structured incident coordination. • Participate in discussions around root causes, share operational insights, and contribute to process improvements that enhance system stability and supportability. • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise. • Proactively strengthen workflows, adopt best practices, and build the foundation of the Reliability Operations function as it evolves.

🎯 Requirements

• Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent hands-on experience. • 2–4 years of experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a related technical support function. • Experience participating in Tier 1 or Tier 2 investigations, including log review, basic triage, and structured escalation. • Exposure to operational environments supporting distributed or cloud-based systems. • Participation in incident response workflows and/or on-call rotations. • Proficiency with Linux, including navigating systems, reviewing logs, and performing basic diagnostics. • Experience using and contributing to runbooks and operational workflows. • Ability to interpret metrics, logs, and traces using tools such as Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry. • Familiarity with cloud platforms, preferably Google Cloud Platform (GCP). • Ability to follow documented remediation steps, with good judgment around when to escalate. • Understanding of CI/CD pipelines and how application deployments affect runtime behavior. • Experience using Jira or similar ticketing systems. • Clear and effective communicator, especially when providing updates during time-sensitive operational issues. • Calm, organized approach to troubleshooting and prioritization. • Collaborative mindset, working effectively with senior operations engineers, product teams, and SREs. • Strong sense of ownership and accountability for operational responsibilities.

🏖️ Benefits

• Continuous operational coverage • Weekend on-call rotation shared across the Reliability Operations team

Apply Now

Similar Jobs

🕒 5 days ago

pod network

1 - 10

🌐 Web 3

Site Reliability Engineer improving and scaling the reliability of the Pod platform, focusing on incident response and operational tooling.

Cloud

Distributed Systems

Docker

Grafana

Linux

Prometheus

Python

Rust

🕒 June 11

Unit4

1001 - 5000

🏢 Enterprise

☁️ SaaS

🤖 Artificial Intelligence

Cloud Operations Engineer at Unit4 solving customer business processing issues and building better solutions with skills in Azure, DevOps, and troubleshooting.

Azure

Cloud

SMTP

SQL

🕒 April 24

LineTen

51 - 200

🛍️ eCommerce

☁️ SaaS

🚗 Transport

Site Reliability Engineer joining LineTen to ensure global coverage of our products. Responsible for engineering support and development experience using Docker and Kubernetes.

Cloud

Docker

Kubernetes