Senior Site Reliability Engineer, Observability and Telemetry Platform

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Senior Site Reliability Engineer, Observability and Telemetry Platform

Job not on LinkedIn

August 22

🏄 California – Remote

💵 $168k - $333.5k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Distributed Systems

Docker

Grafana

Kubernetes

Linux

Open Source

OpenStack

Perl

Prometheus

Python

Ruby

Apply Now

NVIDIA

Artificial Intelligence • Gaming • Automotive

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement • Maintain services once they are live by measuring and monitoring availability, latency and overall system health • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity • Practice sustainable incident response and blameless postmortems • Be part of an on call rotation to support production systems

🎯 Requirements

• BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience • 5+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production • 8+ years experience delivering foundational infrastructure and observability platforms. • Experience in one or more of the following: Python, Go, Perl or Ruby • In depth knowledge on Linux, Networking and Containers

🏖️ Benefits

• Equity and benefits

Apply Now

Similar Jobs

Salesforce DevOps Architect, ML Operations

August 20

Gov Services Hub

51 - 200

🏛️ Government

🔒 Cybersecurity

🎯 Recruiter

Salesforce DevOps Architect providing leadership for multiple Salesforce teams. Managing CI/CD pipelines and enforcing development standards in a remote role.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Site Reliability Engineer

August 20

TensorWave

11 - 50

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Senior SRE building scalable, secure infra for AI compute at TensorWave. Designs low-level systems and automates infrastructure.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

JavaScript

Kubernetes

Linux

Rust

Spring

Terraform

Senior Deployment Engineer

August 20

Atolio

11 - 50

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Deployment Engineer at Atolio: ensure secure, scalable deployments of enterprise search across environments; build automation and collaborate with success teams.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Distributed Systems

Google Cloud Platform

Grafana

Kubernetes

Python

ServiceNow

Splunk

Terraform

Sr. Engineer, DevOps

August 19

Syniti

1001 - 5000

🤝 B2B

🏢 Enterprise

Senior DevOps Engineer at Syniti builds CI/CD pipelines and cloud automation; mentors teams and optimizes DevOps practices for scalable data platform.

🇺🇸 United States – Remote

💰 Private Equity Round on 2017-08

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Cloud

Docker

Jenkins

Kubernetes

Python

Terraform

Sr. Manager, SRE

August 19

Syniti

1001 - 5000

🤝 B2B

🏢 Enterprise

Lead global SRE team at Syniti, ensuring compliant, scalable SaaS platforms; drive IaC, observability, and security across AWS, Azure, and on-prem. Mentor engineers and align with zero-trust principles.

🇺🇸 United States – Remote

💰 Private Equity Round on 2017-08

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Kubernetes

Python

Terraform