AI Infrastructure Engineer

October 31

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure. • Develop and optimize tools to improve infrastructure efficiency and resiliency. • Root cause and analyze and triage failures from the application level to the hardware level. • Enhance infrastructure and products underpinning NVIDIA's AI platforms. • Co-design and implement APIs for integration with NVIDIA's resiliency stacks. • Define meaningful and actionable reliability metrics to track and improve system and service reliability. • Skilled in problem-solving, root cause analysis, and optimization.

🎯 Requirements

• Minimum of 12+ years of experience in developing software infrastructure for large scale AI systems. • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience). • Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level. • Proven track record in building and scaling large-scale distributed systems. • Experience with AI training and inferencing and data infrastructure services. • Familiar in operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki). • Proficiency in programming languages such as Python, C/C++, script languages.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

October 31

AI Infrastructure Engineer at Speechify designing, developing, and optimizing ML platforms and tools for data scientists and ML engineers. Collaborating to create a self-service ML ecosystem that accelerates innovation.

Airflow

AWS

Azure

Cloud

Distributed Systems

Docker

Google Cloud Platform

Kubernetes

Python

PyTorch

Tensorflow

Terraform

October 31

AI Infrastructure Engineer responsible for building scalable ML infrastructure at Speechify. Collaborating with teams to drive machine learning initiatives in a remote environment.

Airflow

AWS

Azure

Cloud

Distributed Systems

Docker

Google Cloud Platform

Kubernetes

Python

PyTorch

Tensorflow

Terraform

October 27

Principal Architect leading the Developer Platform at SentinelOne. Responsible for designing and driving architecture while ensuring exceptional developer experience.

Ansible

AWS

Chef

Cloud

Docker

Google Cloud Platform

Jenkins

Kubernetes

Puppet

Terraform

October 27

Senior Critical Infrastructure Engineer at Serverfarm, developing consistent colocation design standards for global data centers. Supporting deployment of innovative products in a rapidly growing company.

Cloud

October 25

Senior Infrastructure Engineer for leading medical institution's technology environment. Designing, implementing, and managing enterprise infrastructure across on-prem, hybrid, and cloud settings.

AWS

Azure

Cloud

VMware

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com