Senior Site Reliability Engineer

September 2

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• NVIDIA DGX Cloud delivering a fully managed AI platform on major cloud providers • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting • Define SLOs/SLIs, monitor error budgets, and streamline reporting • Support services before launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews • Maintain services once live by measuring and monitoring availability, latency and overall system health • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds • Scale systems sustainably through automation and evolve systems to improve reliability and velocity • Lead triage and root-cause analysis of high-severity incidents, perform blameless postmortems • Participate in on-call rotation to support production services

🎯 Requirements

• BS in Computer Science or related technical field, or equivalent experience • 10+ years of experience operating production services • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet) • Proficiency in at least one high-level programming language (e.g., Python, Go) • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling • Experience building and operating comprehensive observability stacks (OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.) • Experience operating GPU workloads and GPU-accelerated clusters (KubeVirt experience is a plus)

Apply Now

Similar Jobs

August 28

Saaf Finance

2 - 10

🤖 Artificial Intelligence

💸 Finance

💳 Fintech

DevOps Engineer at Saaf Finance builds AI-driven mortgage infrastructure. Designs and maintains AWS-based platforms and CI/CD pipelines.

🇮🇳 India – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

August 27

RemoteStar

11 - 50

🤝 B2B

🎯 Recruiter

☁️ SaaS

DevOps Engineer supporting a company building scalable 3D AEC applications. Manage Azure infrastructure, CI/CD, containers, monitoring, and deployment automation.

🇮🇳 India – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

August 26

T3Cogno

51 - 200

👥 HR Tech

🤝 B2B

DevOps Engineer responsible for CI/CD automation, container orchestration, and cloud tasks.

🇮🇳 India – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

August 25

Zimperium

201 - 500

🔒 Cybersecurity

🏢 Enterprise

☁️ SaaS

Senior Platform Engineer at Zimperium building cloud infrastructure, CI/CD and automation to support mobile security products.

🇮🇳 India – Remote

💰 $12M Venture Round on 2018-11

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

August 20

Endpoint Clinical

501 - 1000

🧬 Biotechnology

⚕️ Healthcare Insurance

💊 Pharmaceuticals

Sr. Manager leads SRE/DevOps teams at Endpoint, an IRT solutions provider; oversees cloud infrastructure, deployment pipelines, and 24x7 operations.

🇮🇳 India – Remote

💵 ₹2M - ₹4M / year

💰 $1.7M Debt Financing on 2010-03

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com