Senior Site Reliability Engineer, Observability

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Senior Site Reliability Engineer, Observability

Job not on LinkedIn

11 hours ago

🏄 California – Remote

🤠 Texas – Remote

💵 $184k - $287.5k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Distributed Systems

ElasticSearch

Java

Linux

Prometheus

Python

Apply Now

NVIDIA

Artificial Intelligence • Gaming • Automotive

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Architecting and operating large-scale observability systems that span global regions and support AI, data, and platform services. • Designing resilient pipelines for metrics, logs, traces, profiling, and events that keep critical systems visible and debuggable. • Working closely with platform, infrastructure, and application teams to establish telemetry standards, instrumentation patterns, and integration workflows. • Automating deployments, scaling workflows, and maintenance tasks to cut down toil and level up operational maturity across the stack. • Defining and maintaining SLOs, SLIs, error budgets, dashboards, and alerting models that guide reliability decisions company-wide. • Building self-service tooling and frameworks that make observability easy to adopt for engineers across NVIDIA. • Studying real system behavior to uncover bottlenecks, scaling limits, failure modes, and long-term architecture risks. • Running day-to-day operations including upgrades, performance tuning, break/fix, and rotations that keep the platform healthy. • Leading incident response and root-cause investigations, then driving the follow-through to eliminate repeat failures. • Guiding engineers through design reviews, operational best practices, and reliability-focused decision making.

🎯 Requirements

• Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience. • 10+ years operating large-scale production systems in roles such as SRE, Production Engineer, or Platform Engineer and 5+ years designing, building, and running observability platforms at scale. • Deep hands-on experience with open-source observability stacks, including Prometheus/Thanos/Mimir for metrics, Loki or Elasticsearch/OpenSearch for logs, and Tempo/Jaeger/OpenTelemetry for tracing and profiling. • Strong programming ability in Python and Go, with Java experience considered a plus. • Solid grounding in Linux internals, networking, storage systems, distributed systems, concurrency, and performance engineering. • Experience architecting multi-region, multi-tenant telemetry pipelines with high availability and strong durability guarantees. • Proven skill in optimizing PromQL, LogQL, trace queries, ingestion paths, indexing strategies, and retention policies. • Strong understanding of SLOs, SLIs, error budgets, incident response, and the operational processes that support reliable systems. • Ability to analyze complex distributed systems, pinpoint failure modes, and drive data-informed debugging and root cause analysis. • Clear communicator who can collaborate effectively across product, platform, infrastructure, and application engineering teams.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

Senior DevOps Engineer – Platform Operations

15 hours ago

Hashgraph

51 - 200

₿ Crypto

🌐 Web 3

🏢 Enterprise

Senior DevOps Engineer deploying applications to production environments for Hedera services. Focus on continuous improvement, automation, and self-service for engineering teams.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Google Cloud Platform

Kubernetes

Linux

Open Source

Terraform

DevOps Engineer – Azure

18 hours ago

VELAIO

51 - 200

DevOps Engineer designing and managing Azure cloud solutions in a dynamic environment. Collaborating with development teams and automating software delivery pipelines using Azure DevOps tools.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇪🇸 Spanish Required

Ansible

Azure

Chef

Puppet

SQL

Terraform

DevOps Engineer – OCI

20 hours ago

Shee Atiká

201 - 500

🌍 Social Impact

DevOps Engineer supporting U.S. government cloud services with compliance and infrastructure coding. Collaborating within Agile teams to enhance security and system functionality.

🇺🇸 United States – Remote

💵 $140k - $170k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Oracle

SDLC

Terraform

Site Reliability Engineer

20 hours ago

Shee Atiká

201 - 500

🌍 Social Impact

Site Reliability Engineer providing operational support for Oracle Cloud Infrastructure in a government project. Responsible for maintaining system reliability and implementing automations.

🇺🇸 United States – Remote

💵 $120k - $150k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

Cloud

Cyber Security

ERP

Grafana

Oracle

Python

Splunk

Terraform

DevOps Engineer

20 hours ago

Shee Atiká

201 - 500

🌍 Social Impact

DevOps Engineer at Alaska Northstar Federal joining a long-term project. Collaborating with stakeholders to advance user-centric design and accessibility best practices in cloud environments.

🇺🇸 United States – Remote

💵 $140k - $170k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Oracle

SDLC

Terraform