Senior Solutions Architect - Cloud Infrastructure and DevOps

May 12

🗣️🇨🇳 Chinese Required

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Maintain large scale HPC/AI clusters with monitoring, logging and alerting. • Manage Linux job/workload schedulers and orchestration tools. • Develop and maintain continuous integration and delivery pipelines. • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources. • Deploy monitoring solutions for the servers, network and storage. • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level. • Being a technical resource, develop, re-define and document standard methodologies to share with internal teams. • Support Research & Development activities and engage in POCs/POVs for future improvements.

🎯 Requirements

• BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields. • At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture. • Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software. • Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments. • Experience in managing and installing HPC clusters, including deployment, optimization, and troubleshooting. • Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes, and Singularity. • Excellent knowledge of Windows and Linux systems (Redhat/CentOS and Ubuntu), including internals, ACLs, OS-level security protections, and common protocols like TCP, DHCP, DNS, etc. • Experience with multiple storage solutions, including Lustre, GPFS, ZFS, and XFS. • Familiarity with newer and emerging storage technologies is a plus. • Proficiency in Python programming and bash scripting. • Knowledge of CI/CD pipelines for software deployment and automation. • Comfortable with automation and configuration management tools, including Jenkins, Ansible, Puppet/Chef, etc. • Ability to communicate technical concepts and collaborate effectively with Mandarin-speaking customers.

Apply Now

Similar Jobs

May 5

Join Snowflake as a Solutions Architect, deploying cloud products for customers and migrating data platforms.

AWS

Azure

Cloud

ETL

Greenplum

Hadoop

HBase

Java

MapReduce

OpenStack

Perl

Python

Ruby

SQL

Tableau

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com