Senior Site Reliability Engineer

September 2

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• NVIDIA DGX Cloud delivering a fully managed AI platform on major cloud providers • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting • Define SLOs/SLIs, monitor error budgets, and streamline reporting • Support services before launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews • Maintain services once live by measuring and monitoring availability, latency and overall system health • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds • Scale systems sustainably through automation and evolve systems to improve reliability and velocity • Lead triage and root-cause analysis of high-severity incidents, perform blameless postmortems • Participate in on-call rotation to support production services

🎯 Requirements

• BS in Computer Science or related technical field, or equivalent experience • 10+ years of experience operating production services • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet) • Proficiency in at least one high-level programming language (e.g., Python, Go) • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling • Experience building and operating comprehensive observability stacks (OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.) • Experience operating GPU workloads and GPU-accelerated clusters (KubeVirt experience is a plus)

Apply Now

Similar Jobs

August 28

DevOps Engineer at Saaf Finance builds AI-driven mortgage infrastructure. Designs and maintains AWS-based platforms and CI/CD pipelines.

Airflow

AWS

Cloud

ETL

JavaScript

Kubernetes

Node.js

Prometheus

Python

Terraform

August 27

DevOps Engineer supporting a company building scalable 3D AEC applications. Manage Azure infrastructure, CI/CD, containers, monitoring, and deployment automation.

Azure

Cloud

Docker

Grafana

Kubernetes

Linux

MongoDB

NGINX

Prometheus

Python

RabbitMQ

August 26

DevOps Engineer responsible for CI/CD automation, container orchestration, and cloud tasks.

Ansible

AWS

Docker

Groovy

Java

Jenkins

Kubernetes

Microservices

Node.js

PHP

Python

August 25

Senior Platform Engineer at Zimperium building cloud infrastructure, CI/CD and automation to support mobile security products.

Android

Ansible

AWS

Azure

Cloud

DNS

Docker

ElasticSearch

Google Cloud Platform

iOS

Kubernetes

Linux

Microservices

Oracle

Postgres

Python

Redis

SQL

Terraform

August 20

Sr. Manager leads SRE/DevOps teams at Endpoint, an IRT solutions provider; oversees cloud infrastructure, deployment pipelines, and 24x7 operations.

AWS

Azure

Cloud

Java

Linux

Perl

Python

SQL

VMware

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com