Senior Prompt and Benchmark Engineer, Evaluation of World Models

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Senior Prompt and Benchmark Engineer, Evaluation of World Models

Job not on LinkedIn

November 18

🏄 California – Remote

💵 $184k - $356.5k / year

⏰ Full Time

🟠 Senior

👷🏻‍♀️ Engineer

🦅 H1B Visa Sponsor

Apply Now

NVIDIA

Artificial Intelligence • Gaming • Automotive

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Develop detailed, domain-specific benchmarks for evaluating world foundation models, especially generation and understanding world models that reason about video, simulation, and physical environments. • Use sophisticated prompt engineering techniques to elicit structured, interpretable responses from a variety of foundation models. • Build, refine, and maintain question banks, multiple-choice formats, and test suites to support both automated and human evaluation workflows. • Employ multiple VLMs in parallel to explore ensemble evaluation methods such as majority voting, ranking agreement, and answer consensus. • Make evaluation as automated and scalable as possible by encoding prompts and expected outputs into structured formats for downstream consumption. • Interface directly with Cosmos researchers to translate their evaluation needs into scalable test cases. • Collaborate with human annotators, providing clearly structured tasks, feedback loops, and quality control mechanisms to ensure dataset reliability. • Meet regularly with domain experts in robotics, autonomous vehicles, and simulation to understand their internal benchmarks, derive transferable metrics, and co-develop standardized evaluation formats.

🎯 Requirements

• 10+ years of experience in Machine Learning, NLP, Human-Computer Interaction, or related fields. • BS, MS, or equivalent background. • Familiarity with evaluating models via prompting, capturing structured outputs, and comparing across model families. • Strong attention to detail in designing natural language questions and formatting structured evaluations. • Proven ability to reason about model capabilities, failure modes, and blind spots in real-world generative model deployments. • Excellent communication and collaboration skills—you will regularly meet with researchers, annotators, and downstream users to iterate on benchmark design. • A working understanding of how VLMs and foundation models function at inference time, including token-level outputs, autoregressive decoding, and model context windows.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

Precast Project Engineer

November 18

Fisher Associates PE, LS, LA, DPC

51 - 200

⚡ Energy

Precast Project Engineer specializing in precast concrete design and mentoring junior engineers. Collaborating on diverse civil and structural projects for Fisher Associates' innovative designs.

🇺🇸 United States – Remote

💵 $100k - $150k / year

⏰ Full Time

🟠 Senior

🔴 Lead

👷🏻‍♀️ Engineer

Senior Radiological Analysis Engineer

November 18

Switzerland Global Enterprise

51 - 200

🤝 B2B

🛍️ eCommerce

Senior Radiological Analysis Engineer providing technical support in nuclear safety analyses at GE Vernova. Focused on dose consequence and dispersion analyses for advanced reactor technology.

🇺🇸 United States – Remote

💵 $111.2k - $213.2k / year

⏰ Full Time

🟠 Senior

👷🏻‍♀️ Engineer

Python

Senior Casting Tooling Engineer

November 18

GE Aerospace

10,000+ employees

🚀 Aerospace

⚡ Energy

Lead tooling supplier development for aerospace investment castings at GE Aerospace. Ensure robust tooling support through evaluation, improvement, and oversight of tooling suppliers.

🇺🇸 United States – Remote

💵 $111.7k - $148.8k / year

⏰ Full Time

🟠 Senior

👷🏻‍♀️ Engineer

Forward Deployed Engineer

November 18

All Hands AI

2 - 10

🤖 Artificial Intelligence

Forward-Deployed Engineer at OpenHands, working with customers to deploy and build integrations. Collaborating directly with technical stakeholders to enhance AI-assisted development.

🇺🇸 United States – Remote

💵 $150k - $215k / year

💰 Seed Round on 2024-09

⏰ Full Time

🟡 Mid-level

🟠 Senior

👷🏻‍♀️ Engineer

Cloud

Docker

JavaScript

Kubernetes

Python

TypeScript

Escalation Engineer

November 18

RedSeal, Inc.

51 - 200

🔒 Cybersecurity

📋 Compliance

☁️ SaaS

Escalation Engineer supporting RedSeal's tech support team handling complex network security issues. Collaborating with R&D and troubleshooting customer network environments.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

👷🏻‍♀️ Engineer

AWS

Firewalls

Java

Kubernetes

Linux

Python

SQL

Unix