Senior Site Reliability Engineer – Observability, Telemetry Platform

🕒 May 14

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews • Maintain services once they are live by measuring and monitoring availability, latency and overall system health • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity • Practice sustainable incident response and blameless postmortems • Be part of an on call rotation to support production systems

🎯 Requirements

• BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience • 8+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production • 5+ years experience delivering foundational infrastructure and observability platforms. • Experience in one or more of the following: Python, Go, Perl or Ruby. • In depth knowledge on Linux, Networking and Containers

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

🕒 May 14

NetBox Labs

11 - 50

🤝 B2B

☁️ SaaS

🏢 Enterprise

Senior DevOps Engineer joining NetBox Labs Cloud Delivery team to enhance AWS infrastructure. Leading projects and mentorship within a fast-paced DevOps environment.

AWS

Cloud

Grafana

Kubernetes

Prometheus

Python

Shell Scripting

Terraform

Go

🕒 May 14

Launch Potato

51 - 200

📱 Media

👥 B2C

Lead Engineer overseeing Launch Potato's cloud infrastructure and SRE function. Evolving CI/CD platform, compliance posture, and leading AWS multi-account migration.

AWS

Cloud

Microservices

Terraform

🕒 May 14

Launch Potato

51 - 200

📱 Media

👥 B2C

Lead DevOps/SRE Engineer evolving cloud infrastructure at Launch Potato. Building an SRE function to enable faster shipping of products while maintaining reliability and cost control.

AWS

Cloud

Grafana

Microservices

Terraform

🕒 May 14

Launch Potato

51 - 200

📱 Media

👥 B2C

Lead SRE/DevOps Engineer at Launch Potato evolving cloud infrastructure and CI/CD platform. Owning SRE function development for faster product team performance without compromising reliability or security.

AWS

Cloud

Grafana

Microservices

Terraform

🕒 May 14

Quantiphi

1001 - 5000

🤖 Artificial Intelligence

🏢 Enterprise

📚 Education

Senior DevOps/Observability Engineer leading unified observability platform design for Fortune 500 clients. Focused on architecting observability pipeline using AWS and modern open-source tools.

AWS

Grafana

Kubernetes

Prometheus

Splunk

Terraform