Senior Storage Production Engineer – DGX Cloud

🕒 June 15

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity. • Develop and maintain storage monitoring, logging, and alerting systems to ensure proactive detection and resolution of performance issues. • Work with AI/ML workloads to improve storage architectures for low-latency access, efficient caching, and high-throughput performance. • Improve the lifecycle of storage services – from inception and design to deployment, operation, and continuous optimization. • Support storage services before they become available through activities such as system build consulting, developing automation frameworks, capacity management, and launch reviews. • Maintain production storage infrastructure by supervising availability, latency, and system health, leveraging predictive analytics and AI-driven automation. • Optimize storage efficiency through compression, deduplication, tiering strategies, and intelligent workload placement. • Scale storage systems sustainably using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques. • Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems. • Practice sustainable incident response and blameless root cause analysis. • Be part of an on-call rotation to support storage and production systems.

🎯 Requirements

• BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field with 8+ years of practical experience. • Experience with distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems. • Solid understanding of block, file, and object storage technologies, including their scalability, reliability, and performance characteristics and standard processes. • Experience with storage networking protocols such as NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, and NVMe over Fabrics. • Expertise in algorithms, data structures, complexity analysis, software design, and automating maintenance of large-scale Linux-based storage systems. • Experience in one or more of the following: C/C++, Java, Python, Go, NodeJS, and Bash for storage automation, monitoring, and performance tuning. • Hands-on experience with infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform for automating storage deployments. • Experience with observability and tracing tools like InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring storage system health.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

🕒 June 12

Pindrop

201 - 500

🔒 Cybersecurity

📡 Telecommunications

💸 Finance

Senior Production Support Engineer ensuring reliability and performance of Pindrop’s platform. Resolving complex production issues and leading incident response while collaborating with engineering and customer success teams.

AWS

Cloud

Google Cloud Platform

Linux

🕒 June 12

ProSidian Consulting

11 - 50

⚡ Energy

🏢 Enterprise

Production Engineer providing technical due diligence and engineering validation for upstream oil and gas projects. Role involves coordinating with various stakeholders and delivering independent engineering advisory services.

PMP

🕒 June 5

Cordial

51 - 200

🤝 B2B

☁️ SaaS

🤖 Artificial Intelligence

Data Scientist focusing on operationalizing, optimizing, and scaling data science models at Cordial. Collaborating with teams to enhance performance and efficiency.

Airflow

AWS

BigQuery

Cloud

Python

🕒 May 14

Crown

51 - 200

🤝 B2B

⚡ Energy

🧬 Biotechnology

Sr. Manager Production Engineering focusing on electrical programming and problem-solving in food can manufacturing. Extensive travel and collaboration with manufacturing facilities in the US.

🕒 November 26, 2025

DoubleZero Foundation

1 - 10

🔒 Cybersecurity

SRE role at DoubleZero focused on automation-first reliability systems in Go, ensuring infrastructure's production readiness and performance.

Distributed Systems

Go