Senior Site Reliability Engineer, Observability and Telemetry Platform

Job not on LinkedIn

August 22

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement • Maintain services once they are live by measuring and monitoring availability, latency and overall system health • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity • Practice sustainable incident response and blameless postmortems • Be part of an on call rotation to support production systems

🎯 Requirements

• BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience • 5+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production • 8+ years experience delivering foundational infrastructure and observability platforms. • Experience in one or more of the following: Python, Go, Perl or Ruby • In depth knowledge on Linux, Networking and Containers

🏖️ Benefits

• Equity and benefits

Apply Now

Similar Jobs

August 20

Salesforce DevOps Architect providing leadership for multiple Salesforce teams. Managing CI/CD pipelines and enforcing development standards in a remote role.

Cloud

August 20

Senior SRE building scalable, secure infra for AI compute at TensorWave. Designs low-level systems and automates infrastructure.

Cloud

JavaScript

Kubernetes

Linux

Rust

Spring

Terraform

Go

August 20

Deployment Engineer at Atolio: ensure secure, scalable deployments of enterprise search across environments; build automation and collaborate with success teams.

AWS

Azure

Cloud

Distributed Systems

Google Cloud Platform

Grafana

Kubernetes

Python

ServiceNow

Splunk

Terraform

Go

August 19

Senior DevOps Engineer at Syniti builds CI/CD pipelines and cloud automation; mentors teams and optimizes DevOps practices for scalable data platform.

AWS

Cloud

Docker

Jenkins

Kubernetes

Python

Terraform

Go

August 19

Lead global SRE team at Syniti, ensuring compliant, scalable SaaS platforms; drive IaC, observability, and security across AWS, Azure, and on-prem. Mentor engineers and align with zero-trust principles.

AWS

Azure

Cloud

Kubernetes

Python

Terraform

Go

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com