AI Benchmark Engineer – Native Language Specialist, Spanish

🕒 February 26

🗣️🇪🇸 Spanish Required

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of LILT AI

LILT AI

201 - 500 employees

Founded 2015

🤖 Artificial Intelligence

☁️ SaaS

🏢 Enterprise

🔥 Funding within the last year

💰 $25M Funding Round - LILT on 2025-06

Artificial Intelligence • SaaS • Enterprise

LILT AI is a multilingual AI platform that helps enterprises and public-sector organizations create, translate, verify, and manage content across languages at scale. It combines domain-specific AI models, continuous model training, and a human intelligence layer of professional linguists to deliver secure, brand-consistent localization, translation verification, and multilingual model development. The platform offers enterprise project management, 100+ native integrations, on-prem and air-gapped deployments, and tools to train, evaluate, and govern multilingual AI for use cases like website localization, product launches, technical documentation, and regulatory/compliance communications.

📋 Description

• We are building a rigorous, verifiable evaluation suite of Terminal-Bench tasks designed to test the limits of large language models on multilingual software challenges. • Our goal is to measure multilingual robustness across prompt language effects, non-English data processing, and complex locale/encoding edge cases in terminal workflows. • You will create high-signal, high-quality tasks that genuinely test a model's ability to handle multilingual environments without relying on English translation crutches. • Asset Creation: Build realistic task environments using datasets and files in your native language. • Crucially, these assets must remain in the target language to genuinely measure multilingual handling. • Analyze execution logs and calibrate task difficulty (Easy to Very Hard) using standard Terminal-Bench run configurations against various model tiers (Haiku, Sonnet, Opus). • Participate in a rigorous, 4-layer human quality control process alongside automated LLM-based checks to ensure fairness, grammatical accuracy, and benchmark integrity.

🎯 Requirements

• 5+ years of industry experience in software engineering. • Proven track record at leading technology companies and/or graduation from top-tier engineering universities. • Native or near-native fluency, with a deep understanding of its grammar, register, and phrasing rules. High English proficiency. • Strong proficiency in Python, standard shell scripting, and data processing. • Extensive experience with Terminal/CLI-based development workflows and a working familiarity with coding agents. • Deep technical understanding of multilingual text processing pitfalls, including: • - Encoding/decoding robustness and Unicode normalization. • - Locale-dependent conventions (collation, casing, non-Gregorian dates). • - Text I/O, toolchain interoperability, and safe string operations. • - *(For specific languages)* Bidirectional/RTL handling, font fallbacks, and rendering/typography in UI or artifacts.

🏖️ Benefits

• Note this is a remote, freelance opportunity • Earn money. Have fun. Advance human knowledge. • Get paid quickly and fairly, and build your professional network in a supportive community

Apply Now