Systems Software Engineer, Kubernetes Scale

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Drive end-to-end performance and scale characterization for the NVIDIA DGX Cloud software stack • Collaborate with AI researchers, developers and customers to develop innovative, automated tests • Deep dive into performance and scale issues in complex distributed systems • Design and develop monitoring, reporting and analysis tools for performance and scale testing • Triage, debug and root cause issues related to operating Kubernetes clusters at ultra-large scale • Build and maintain a high-velocity framework that enables continuous performance and scale testing • Document research, methodologies and results clearly and concisely • Engage efficiently with upstream communities

🎯 Requirements

• 2+ years of experience • Computer Architecture, Networking, Storage systems, Accelerators • Bachelors/Masters in Engineering (preferably, Electrical Engineering, Computer Engineering, or Computer Science) or equivalent experience • Expertise in Kubernetes and familiarity with related CNCF projects • Background in working with large scale parallel and distributed accelerator-based systems • Expertise optimizing performance and AI workloads on large scale systems • Experience with performance modeling and benchmarking at scale • Proficiency in Golang/Python • Background with the NVIDIA software ecosystem in both training and inference domains • Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI for example)

Apply Now

Similar Jobs

🕒 June 19

Solera, Inc.

5001 - 10000

🚗 Transport

☁️ SaaS

Developer and Data Analyst supporting financial systems development at Solera. Designing, developing, and maintaining applications while ensuring operational stability and automation in processes.

🗣️🇩🇪 German Required

ASP.NET

Cloud

SOAP

SQL

.NET

🕒 May 15

Vecima Networks Inc.

501 - 1000

📡 Telecommunications

🔧 Hardware

📱 Media

System Engineer supporting development teams with infrastructure and automation tasks. Working with production infrastructure, CI/CD, and automating deployment processes in DevOps.

🗣️🇵🇱 Polish Required

Ansible

Docker

Kubernetes

Linux

Python

Terraform

Go

🕒 May 5

Centuria

201 - 500

🚀 Aerospace

🔒 Cybersecurity

🏛️ Government

Mid Linux & Cloud Systems Engineer maintaining IT infrastructure stability and security for clients like Castorama and Medicover. Collaborating with SysOps and DevOps experts in remote settings.

🗣️🇵🇱 Polish Required

Ansible

AWS

Azure

Cloud

Google Cloud Platform

Grafana

Linux

Prometheus

Puppet

TCP/IP

Terraform

🕒 April 21

Veeam Software

1001 - 5000

☁️ SaaS

🔒 Cybersecurity

🏢 Enterprise

Systems Engineer driving business through technical engagement and demos for Veeam’s portfolio. Collaborating closely with Sales and customers while delivering training and hands-on support.

🗣️🇵🇱 Polish Required

AWS

Azure

Cloud

VMware

🕒 March 31

Infotree Global Solutions

1001 - 5000

🎯 Recruiter

👥 HR Tech

🏢 Enterprise

System Engineer creating documentation and supporting automotive projects for Infotree Global Solutions. Involves collaboration, customer communication, and adherence to E&S quality standards.