Senior MLOps Engineer, GenAI Framework

Job not on LinkedIn

November 14

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Architect and manage the continuous integration pipelines and release processes of our Generative AI framework and libraries related to Megatron-LM and NeMo Framework. • Design and implement efficient and scalable DevOps solutions to allow our fast growing team to release software more frequently while maintaining high-quality and maximum performance. • Work with industry standard tools (Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, Jira) in hybrid on-premise and cloud environments. • Assist with cluster operations and system administration (managing: servers, team accounts, clusters). • Accelerate research and development cycles by automating recurring tasks such as accuracy and performance regression detection. • Developing new quality control measures, e.g. code analysis, backwards compatibility, and regression testing, while employing and advancing best-practices. • Work closely with DL frameworks and libraries (CUDA, cuDNN, cuBLAS, and PyTorch) teams and with other engineering teams within NVIDIA that provide software, testing, and release related infrastructure.

🎯 Requirements

• BS or MS degree in Computer Science, Computer Architecture or related technical field (or equivalent experience) and 3+ years of industry experience in DevOps and infrastructure engineering. • Strong system level programming in languages like Python and shell scripting. • Extensive understanding of build/release systems, CI/CD and experience with solutions like Gitlab, Github, Jenkins etc. • Experience with Linux system administration. • Proficient with containerization and cluster management technologies like Docker and Kubernetes. • Experience in build tools, including Make, Cmake. • A strong background in source code management (SCM) solutions such as GitLab, GitHub, Perforce, etc. • Well-versed problem-solving and debugging skills. • Great teammate who can collaborate and influence others in a dynamic environment. • Excellent interpersonal and written communication skills.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

November 13

Senior Machine Learning Engineer exploring deep learning and autonomous driving solutions at Bot Auto. Collaborating across teams to innovate and develop machine learning technologies.

PyTorch

Tensorflow

November 12

Senior Machine Learning Engineer building user experiences on Reddit using ML and LLMs. Collaborate with product and design teams to improve user engagement and retention.

Python

PyTorch

SQL

Tensorflow

Go

November 11

Senior Machine Learning Engineer designing and building AI systems for Affinity's relationship intelligence platform. Collaborating with cross-functional teams to shape the future of private capital's CRM platform.

Azure

Neo4j

Python

PyTorch

Scikit-Learn

SQL

November 7

Senior Machine Learning Engineer in charge of building and improving ML models for Spotify's personalization features. Collaborating with teams to enhance user satisfaction through recommendation systems.

Apache

Java

Python

PyTorch

Scala

Scikit-Learn

Spark

Tensorflow

November 7

AI/ML Engineer driving product innovation and delivering high-impact results for vidIQ. Collaborating with cross-functional teams to establish architecture and implement machine learning solutions.

Airflow

Docker

DynamoDB

Kafka

Kubernetes

NoSQL

Postgres

Python

RDBMS

Spark

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com