Senior Site Reliability Engineer, BCM – DGX Cloud

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Senior Site Reliability Engineer, BCM – DGX Cloud

October 28

🏄 California – Remote

💵 $168k - $333.5k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Kubernetes

Linux

Python

Apply Now

NVIDIA

Artificial Intelligence • Gaming • Automotive

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Contributing to deployments and daily operations of large scale next-generation GPU platforms • Handling incidents in GPU clusters, bridging the gap between cluster operations and development • Designing and implementing small features in the Base Command Manager product to become intimately familiar with the workings of the product • Validating complex cluster configurations including Slurm and Kubernetes orchestrators for performance, scalability and resilience, ensuring they meet real-world customer scenarios.

🎯 Requirements

• Bachelor's Degree or equivalent experience in Computer Science or related field. • 8+ years of experience in site reliability engineering and/or software development roles. • Fluency in Python • In-depth knowledge of Linux and networking • Experience with C++, high-performance computing, Kubernetes and/or system administration would be an asset • Previous experience as a system admin running BCM/Bright Cluster Manager/Base Command Manager clusters is a definite plus. • Proficiency with cluster networking including InfiniBand and Spectrum-X

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

Site Reliability Engineer, Platform Infrastructure

October 28

Hopper

201 - 500

Senior Site Reliability Engineer at Hopper's Platform Infrastructure team. Building and operating cloud foundation for products used by millions of travelers worldwide.

🇺🇸 United States – Remote

💰 $96M Venture Round on 2022-11

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Distributed Systems

DNS

Google Cloud Platform

Kubernetes

NoSQL

Python

SQL

Terraform

Site Reliability Engineer – Platform Infrastructure

October 28

Hopper

201 - 500

Senior Site Reliability Engineer for platform infrastructure in a growing travel tech company. Enhancing automated, self-service tools for engineers while ensuring performance and reliability.

🇺🇸 United States – Remote

💵 $150k - $350k / year

💰 $96M Venture Round on 2022-11

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Distributed Systems

DNS

Google Cloud Platform

Kubernetes

NoSQL

Python

SQL

Terraform

Senior DevOps Engineer

October 28

SmithRx

51 - 200

⚕️ Healthcare Insurance

☁️ SaaS

🤝 B2B

Sr. DevOps Engineer managing cloud infrastructure and CI/CD for health-tech company. Collaborating across teams and implementing best DevOps practices in a transformative environment.

🇺🇸 United States – Remote

💰 $20M Series B on 2022-03

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Amazon Redshift

AWS

BigQuery

Cloud

Groovy

Kubernetes

NoSQL

Perl

Postgres

Python

Redis

Ruby

SQL

Terraform

DevOps Specialist

October 28

Medical Web Experts

51 - 200

⚕️ Healthcare Insurance

☁️ SaaS

📋 Compliance

DevOps Specialist optimizing cloud infrastructure deployments for a patient engagement healthcare platform. Collaborating with engineering teams to enhance security, automation, and product rollouts.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Cloud

Cyber Security

Google Cloud Platform

Jenkins

Kubernetes

Linux

Microservices

Python

Terraform

Senior Site Reliability Engineer – AWS, AI/ML, APM

October 26

Granicus

501 - 1000

🏛️ Government

☁️ SaaS

📋 Compliance

Senior Site Reliability Engineer ensuring reliability and performance of cloud services for the Govtech industry. Leading automation efforts and collaborating with software engineers for best practices.

🇺🇸 United States – Remote

💵 $80k - $100k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

AWS

Azure

Chef

Cloud

ElasticSearch

Java

Linux

Logstash

Puppet

Python

Ruby

Unix