Senior Solutions Architect – AI Factory Observability and Visualization

🔥 2 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Run AI factory validation tools, microbenchmarks, and workloads provided by the team • Gain a comprehensive understanding of the system from start to finish • Establish what "healthy" represents across the stack • Build and extend the telemetry surface across hardware, fabric, and workload • Serve as the observability expert, investigating gaps in visibility • Develop automation (Python, Shell) for collecting, transforming, and presenting system and network data • Recommend improvements to system visibility, data sources, and reporting • Collaborate with hardware, software, networking, datacenter, and product groups

🎯 Requirements

• Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field • 6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings • Hands-on experience with the architecture of multi-GPU and/or multi-node clusters • Solid grasp of how HPC and AI factory systems fit together end to end • Proficiency with Python and Shell/Bash for scripting, automation, and tooling • Practical experience working with observability systems (e.g., Prometheus, Grafana, Loki, or similar) • Experience transforming metrics, logs, and traces into clear, actionable insight for complex distributed environments • Familiarity with GPU and fabric telemetry (e.g., DCGM, NVLink, InfiniBand/Ethernet fabric counters) • Strong communication skills and the ability to work effectively with cross-functional teams

🏖️ Benefits

• Eligible for equity • Health insurance • Professional development opportunities

Apply Now

Similar Jobs

🔥 23 minutes ago

Natera

1001 - 5000

🧬 Biotechnology

⚕️ Healthcare Insurance

💊 Pharmaceuticals

Forward Deployed AI Solutions Engineer enhancing efficiency using AI agents and automation tools at Natera. Collaborating with business domains to optimize workflows and run them in production.

Cloud

ERP

Python

SQL

🔥 40 minutes ago

iRhythm Technologies, Inc.

1001 - 5000

⚕️ Healthcare Insurance

🧬 Biotechnology

Integration Engineer managing IRIS for Health integration platform environments in a fully remote role. Ensuring performance, reliability, and automation across healthcare integration systems.

AWS

Cloud

Distributed Systems

EC2

JavaScript

Linux

Python

Shell Scripting

Unix

🔥 2 hours ago

Orion Innovation

5001 - 10000

🏢 Enterprise

☁️ SaaS

📡 Telecommunications

Integration Engineer implementing HRIS solutions across various industries at Orion Innovation. Collaborating with clients and developing technical systems on Azure and Biztalk.

Azure

🔥 3 hours ago

Alkami Technology

501 - 1000

🏦 Banking

💳 Fintech

☁️ SaaS

Platform Solution Engineer handling technical configurations for customer environments in a fintech company. Collaborating with Implementation Managers and Activation Engineers during system deployment.

🇺🇸 United States – Remote

💵 $100k - $120k / year

💰 $300M Post-IPO Debt - Alkami Technology on 2025-03

⏰ Full Time

🟡 Mid-level

🟠 Senior

💻 Solutions Engineer

🔥 4 hours ago

Empower

10,000+ employees

💸 Finance

💳 Fintech

👥 B2C

Senior Machine Learning Solutions Architect designing end-to-end machine learning solutions for financial services industry. Collaborating with data scientists and engineers to implement scalable systems.

AWS

Distributed Systems