Senior Solutions Architect – AI Factory Deployment

🕒 April 29

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of NVIDIA

NVIDIA

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters. • Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks. • Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis. • Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform. • Build and improve observability for AI factories (metrics, logs, traces, dashboards) to understand workload behavior and system health. • Develop automation (Python, Shell) for running benchmarks, collecting results, and performing regression checks. • Examine communication patterns and NCCL usage for AI/LLM workloads, concentrating on collectives such as AllReduce and AllToAll. • Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency. • Work closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer use. • Contribute to documentation, guidelines, and readiness collateral that support internal collaborators and customer-facing teams.

🎯 Requirements

• Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field. • More than 6+ years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML settings. • Hands-on experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with practical knowledge of NCCL. • Solid grasp of collective communication patterns, particularly AllReduce and AllToAll, and how they are applied in contemporary ML/LLM training. • Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow. • Proficiency with Python and Shell/Bash for scripting, automation, and tooling. • Experience with benchmarking (crafting, executing, and interpreting performance benchmarks). • Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and optimize complex distributed workloads. • Strong communication skills and the ability to work effectively with cross-functional teams.

🏖️ Benefits

• Eligible for equity and benefits

Apply Now

Similar Jobs

🕒 April 29

Saviynt

501 - 1000

☁️ SaaS

🔒 Cybersecurity

🏢 Enterprise

Drive technical success of Technology and Cloud partnerships at Saviynt, acting as technical advisor for Tech partners. Support revenue-generating initiatives and lead a team of SEs/SAs.

🇺🇸 United States – Remote

💰 $130M Private Equity Round on 2021-09

⏰ Full Time

🟠 Senior

💻 Solutions Engineer

🦅 H1B Visa Sponsor

info

🕒 April 29

Databricks

1001 - 5000

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Solutions Architect providing technical leadership in big data solutions for customers at Databricks. Collaborating with sales and engineers to implement innovative data strategies.

🇺🇸 United States – Remote

💵 $180k - $247.5k / year

💰 $1.6G Series H on 2021-08

⏰ Full Time

🟡 Mid-level

🟠 Senior

💻 Solutions Engineer

🦅 H1B Visa Sponsor

info

🕒 April 29

Salt Security

201 - 500

Solutions Engineer partnering with sales team to drive technical aspects of API security sales process. Delivering presentations, building relationships, and demonstrating value for customers.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

💻 Solutions Engineer

🕒 April 29

Flywire

1001 - 5000

💸 Finance

💳 Fintech

AI Marketing Technology Solutions Architect at Flywire, optimizing marketing with AI-driven solutions. Connecting data and enhancing discoverability in AI-driven search environments.

🇺🇸 United States – Remote

💵 $115k - $150k / year

💰 $60M Series F on 2021-03

⏰ Full Time

🟡 Mid-level

🟠 Senior

💻 Solutions Engineer

🦅 H1B Visa Sponsor

info

🕒 April 29

Flex

11 - 50

💳 Fintech

☁️ SaaS

🤝 B2B

Founding Solutions Architect at Flex working on enterprise deals and technical integration. Collaborating closely with sales and engineering to shape company direction in the health and wellness payment space.

🇺🇸 United States – Remote

💵 $160k - $260k / year

⏰ Full Time

🟠 Senior

🔴 Lead

💻 Solutions Engineer

🦅 H1B Visa Sponsor

info