Senior Prompt and Benchmark Engineer, Evaluation of World Models

Job not on LinkedIn

November 18

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

📋 Description

• Develop detailed, domain-specific benchmarks for evaluating world foundation models, especially generation and understanding world models that reason about video, simulation, and physical environments. • Use sophisticated prompt engineering techniques to elicit structured, interpretable responses from a variety of foundation models. • Build, refine, and maintain question banks, multiple-choice formats, and test suites to support both automated and human evaluation workflows. • Employ multiple VLMs in parallel to explore ensemble evaluation methods such as majority voting, ranking agreement, and answer consensus. • Make evaluation as automated and scalable as possible by encoding prompts and expected outputs into structured formats for downstream consumption. • Interface directly with Cosmos researchers to translate their evaluation needs into scalable test cases. • Collaborate with human annotators, providing clearly structured tasks, feedback loops, and quality control mechanisms to ensure dataset reliability. • Meet regularly with domain experts in robotics, autonomous vehicles, and simulation to understand their internal benchmarks, derive transferable metrics, and co-develop standardized evaluation formats.

🎯 Requirements

• 10+ years of experience in Machine Learning, NLP, Human-Computer Interaction, or related fields. • BS, MS, or equivalent background. • Familiarity with evaluating models via prompting, capturing structured outputs, and comparing across model families. • Strong attention to detail in designing natural language questions and formatting structured evaluations. • Proven ability to reason about model capabilities, failure modes, and blind spots in real-world generative model deployments. • Excellent communication and collaboration skills—you will regularly meet with researchers, annotators, and downstream users to iterate on benchmark design. • A working understanding of how VLMs and foundation models function at inference time, including token-level outputs, autoregressive decoding, and model context windows.

🏖️ Benefits

• equity • benefits

Apply Now

Similar Jobs

November 18

Precast Project Engineer specializing in precast concrete design and mentoring junior engineers. Collaborating on diverse civil and structural projects for Fisher Associates' innovative designs.

🇺🇸 United States – Remote

💵 $100k - $150k / year

⏰ Full Time

🟠 Senior

🔴 Lead

👷🏻‍♀️ Engineer

November 18

Switzerland Global Enterprise

51 - 200

🤝 B2B

🛍️ eCommerce

Senior Radiological Analysis Engineer providing technical support in nuclear safety analyses at GE Vernova. Focused on dose consequence and dispersion analyses for advanced reactor technology.

🇺🇸 United States – Remote

💵 $111.2k - $213.2k / year

⏰ Full Time

🟠 Senior

👷🏻‍♀️ Engineer

November 18

GE Aerospace

10,000+ employees

🚀 Aerospace

⚡ Energy

Lead tooling supplier development for aerospace investment castings at GE Aerospace. Ensure robust tooling support through evaluation, improvement, and oversight of tooling suppliers.

🇺🇸 United States – Remote

💵 $111.7k - $148.8k / year

⏰ Full Time

🟠 Senior

👷🏻‍♀️ Engineer

November 18

All Hands AI

2 - 10

🤖 Artificial Intelligence

Forward-Deployed Engineer at OpenHands, working with customers to deploy and build integrations. Collaborating directly with technical stakeholders to enhance AI-assisted development.

🇺🇸 United States – Remote

💵 $150k - $215k / year

💰 Seed Round on 2024-09

⏰ Full Time

🟡 Mid-level

🟠 Senior

👷🏻‍♀️ Engineer

November 18

RedSeal, Inc.

51 - 200

🔒 Cybersecurity

📋 Compliance

☁️ SaaS

Escalation Engineer supporting RedSeal's tech support team handling complex network security issues. Collaborating with R&D and troubleshooting customer network environments.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

👷🏻‍♀️ Engineer

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com