Senior Site Reliability Engineer, BCM – DGX Cloud

October 28

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Contributing to deployments and daily operations of large scale next-generation GPU platforms • Handling incidents in GPU clusters, bridging the gap between cluster operations and development • Designing and implementing small features in the Base Command Manager product to become intimately familiar with the workings of the product • Validating complex cluster configurations including Slurm and Kubernetes orchestrators for performance, scalability and resilience, ensuring they meet real-world customer scenarios.

🎯 Requirements

• Bachelor's Degree or equivalent experience in Computer Science or related field. • 8+ years of experience in site reliability engineering and/or software development roles. • Fluency in Python • In-depth knowledge of Linux and networking • Experience with C++, high-performance computing, Kubernetes and/or system administration would be an asset • Previous experience as a system admin running BCM/Bright Cluster Manager/Base Command Manager clusters is a definite plus. • Proficiency with cluster networking including InfiniBand and Spectrum-X

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

October 28

Hopper

201 - 500

Senior Site Reliability Engineer at Hopper's Platform Infrastructure team. Building and operating cloud foundation for products used by millions of travelers worldwide.

Cloud

Distributed Systems

DNS

Google Cloud Platform

Kubernetes

NoSQL

Python

SQL

Terraform

October 28

Hopper

201 - 500

Senior Site Reliability Engineer for platform infrastructure in a growing travel tech company. Enhancing automated, self-service tools for engineers while ensuring performance and reliability.

Cloud

Distributed Systems

DNS

Google Cloud Platform

Kubernetes

NoSQL

Python

SQL

Terraform

October 28

Sr. DevOps Engineer managing cloud infrastructure and CI/CD for health-tech company. Collaborating across teams and implementing best DevOps practices in a transformative environment.

Amazon Redshift

AWS

BigQuery

Cloud

Groovy

Kubernetes

NoSQL

Perl

Postgres

Python

Redis

Ruby

SQL

Terraform

Go

October 28

DevOps Specialist optimizing cloud infrastructure deployments for a patient engagement healthcare platform. Collaborating with engineering teams to enhance security, automation, and product rollouts.

Ansible

AWS

Azure

Cloud

Cyber Security

Google Cloud Platform

Jenkins

Kubernetes

Linux

Microservices

Python

Terraform

October 26

Senior Site Reliability Engineer ensuring reliability and performance of cloud services for the Govtech industry. Leading automation efforts and collaborating with software engineers for best practices.

Ansible

AWS

Azure

Chef

Cloud

ElasticSearch

Java

Linux

Logstash

Puppet

Python

Ruby

Unix

Go

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com