Senior Solutions Architect – AI Factory Observability and Visualization

🔥 41 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Run AI factory validation tools, microbenchmarks, and workloads provided by the team • Gain a comprehensive understanding of the system from start to finish • Establish what "healthy" represents across the stack • Build and extend the telemetry surface across hardware, fabric, and workload • Serve as the observability expert, investigating gaps in visibility • Develop automation (Python, Shell) for collecting, transforming, and presenting system and network data • Recommend improvements to system visibility, data sources, and reporting • Collaborate with hardware, software, networking, datacenter, and product groups

🎯 Requirements

• Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field • 6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings • Hands-on experience with the architecture of multi-GPU and/or multi-node clusters • Solid grasp of how HPC and AI factory systems fit together end to end • Proficiency with Python and Shell/Bash for scripting, automation, and tooling • Practical experience working with observability systems (e.g., Prometheus, Grafana, Loki, or similar) • Experience transforming metrics, logs, and traces into clear, actionable insight for complex distributed environments • Familiarity with GPU and fabric telemetry (e.g., DCGM, NVLink, InfiniBand/Ethernet fabric counters) • Strong communication skills and the ability to work effectively with cross-functional teams

🏖️ Benefits

• Eligible for equity • Health insurance • Professional development opportunities

Apply Now

Similar Jobs

🔥 1 hour ago

Natera

1001 - 5000

🧬 Biotechnology

⚕️ Healthcare Insurance

💊 Pharmaceuticals

Forward Deployed AI Solutions Engineer enhancing efficiency using AI agents and automation tools at Natera. Collaborating with business domains to optimize workflows and run them in production.

🔥 1 hour ago

iRhythm Technologies, Inc.

1001 - 5000

⚕️ Healthcare Insurance

🧬 Biotechnology

Integration Engineer managing IRIS for Health integration platform environments in a fully remote role. Ensuring performance, reliability, and automation across healthcare integration systems.

🔥 3 hours ago

Orion Innovation

5001 - 10000

🏢 Enterprise

☁️ SaaS

📡 Telecommunications

Integration Engineer implementing HRIS solutions across various industries at Orion Innovation. Collaborating with clients and developing technical systems on Azure and Biztalk.

🔥 4 hours ago

Alkami Technology

501 - 1000

🏦 Banking

💳 Fintech

☁️ SaaS

Platform Solution Engineer handling technical configurations for customer environments in a fintech company. Collaborating with Implementation Managers and Activation Engineers during system deployment.

🇺🇸 United States – Remote

💵 $100k - $120k / year

💰 $300M Post-IPO Debt - Alkami Technology on 2025-03

⏰ Full Time

🟡 Mid-level

🟠 Senior

💻 Solutions Engineer

🔥 4 hours ago

Empower

10,000+ employees

💸 Finance

💳 Fintech

👥 B2C

Senior Machine Learning Solutions Architect designing end-to-end machine learning solutions for financial services industry. Collaborating with data scientists and engineers to implement scalable systems.