Distinguished Site Reliability Engineer – Cloud

🕒 May 4

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Lead, design, implement and support operational and reliability aspects of large scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews • Maintain services once they are live by measuring and monitoring availability, latency and overall system health • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity • Practice sustainable incident response and blameless postmortems • Be part of an on call rotation to support production systems

🎯 Requirements

• BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience • 16+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production • Experience in one or more of the following: Python, Go, Perl or Ruby • In depth knowledge on Linux, Networking and Containers

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

🕒 May 3

1Password

501 - 1000

🔒 Cybersecurity

☁️ SaaS

⚡ Productivity

Staff Security Engineer leading DevSecOps within Corporate Security team at 1Password. Responsible for securing developer environments and overseeing GitHub security.

🇺🇸 United States – Remote

💵 $192k - $278k / year

💰 $620M Series C on 2022-01

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

info

🕒 May 2

Ad Hoc LLC

501 - 1000

🏛️ Government

🤖 Artificial Intelligence

🔌 API

DevOps Engineer III at Ad Hoc enhancing digital services for Veterans Affairs. Collaborating on cloud infrastructure, CI/CD processes, and simplifying DevOps practices.

🇺🇸 United States – Remote

💵 $100k - $104k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 May 2

Ad Hoc LLC

501 - 1000

🏛️ Government

🤖 Artificial Intelligence

🔌 API

Staff DevOps Engineer responsible for leading and improving cloud infrastructure for VA services. Collaborating with stakeholders and mentoring team members in software engineering best practices.

🇺🇸 United States – Remote

💵 $120k - $135k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 May 2

National Resident Matching Program® (NRMP®)

11 - 50

📚 Education

⚕️ Healthcare Insurance

Manager, DevOps responsible for software delivery practices and cloud platform oversight at NRMP. Leading release management and cross-functional team coordination in a complex environment.

🇺🇸 United States – Remote

💵 $157.6k - $173.7k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 May 1

Database Reliability Engineer at Nodal Exchange ensuring PostgreSQL infrastructure supports critical trading operations. Responsible for overall database performance, reliability, and strategy for a financial marketplace.