Senior Site Reliability Engineer – Observability, Telemetry Platform

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

Senior Site Reliability Engineer – Observability, Telemetry Platform

🕒 May 14

🏄 California – Remote

💵 $168k - $270.3k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Distributed Systems

Linux

Perl

Python

Ruby

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

📋 Description

• Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews • Maintain services once they are live by measuring and monitoring availability, latency and overall system health • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity • Practice sustainable incident response and blameless postmortems • Be part of an on call rotation to support production systems

🎯 Requirements

• BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience • 8+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production • 5+ years experience delivering foundational infrastructure and observability platforms. • Experience in one or more of the following: Python, Go, Perl or Ruby. • In depth knowledge on Linux, Networking and Containers

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

Senior DevOps Engineer, Cloud Delivery

🕒 May 14

NetBox Labs

11 - 50

🤝 B2B

☁️ SaaS

🏢 Enterprise

Senior DevOps Engineer joining NetBox Labs Cloud Delivery team to enhance AWS infrastructure. Leading projects and mentorship within a fast-paced DevOps environment.

🇺🇸 United States – Remote

💵 $165k - $185k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Grafana

Kubernetes

Prometheus

Python

Shell Scripting

Terraform

Lead Engineer, DevOps – SRE

🕒 May 14

Launch Potato

51 - 200

📱 Media

👥 B2C

Lead Engineer overseeing Launch Potato's cloud infrastructure and SRE function. Evolving CI/CD platform, compliance posture, and leading AWS multi-account migration.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Microservices

Terraform

Lead DevOps/SRE Engineer

🕒 May 14

Launch Potato

51 - 200

📱 Media

👥 B2C

Lead DevOps/SRE Engineer evolving cloud infrastructure at Launch Potato. Building an SRE function to enable faster shipping of products while maintaining reliability and cost control.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Grafana

Microservices

Terraform

Lead SRE/DevOps Engineer

🕒 May 14

Launch Potato

51 - 200

📱 Media

👥 B2C

Lead SRE/DevOps Engineer at Launch Potato evolving cloud infrastructure and CI/CD platform. Owning SRE function development for faster product team performance without compromising reliability or security.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Grafana

Microservices

Terraform

DevOps Engineer, Observability

🕒 May 14

Quantiphi

1001 - 5000

🤖 Artificial Intelligence

🏢 Enterprise

📚 Education