AI/ML Evaluation Engineer

SaaS • B2B • Enterprise

Truelogic Software is a nearshore software development company specializing in agile staff augmentation services. They focus on providing custom outsourced software development with a team of highly skilled engineers from Latin America. Truelogic Software partners with both startups and Fortune 500 companies, offering solutions that align with their clients' time zones and ensuring high-quality outcomes through collaboration and responsiveness. With a presence in over 25 countries, Truelogic emphasizes remote work for better quality of life, and their engineers are experienced in various industries, delivering a wide range of successful projects globally.

501 - 1000 employees

Founded 2004

☁️ SaaS

🤝 B2B

🏢 Enterprise

AI/ML Evaluation Engineer

November 20

🇦🇷 Argentina – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

Apply Now

Truelogic Software

SaaS • B2B • Enterprise

501 - 1000 employees

Founded 2004

☁️ SaaS

🤝 B2B

🏢 Enterprise

📋 Description

• Write Python and SQL scripts to evaluate outputs from large language models (LLMs). • Design and implement LLM-as-Judge evaluations with clear scoring rubrics (faithfulness, relevance, completeness, correctness). • Define and calculate metrics such as exact match, token-level F1, ROUGE, cosine similarity, and subjective rubric scores. • Build and maintain ground-truth datasets for benchmarking and regression testing. • Automate evaluation workflows and integrate them into CI/CD pipelines. • Analyze large unstructured datasets to identify inconsistencies, anomalies, biases, and missing values. • Diagnose failure modes such as hallucinations, irrelevant answers, and formatting issues. • Produce clear reports summarizing evaluation findings and quality trends. • Collaborate with AI engineers, QA, data scientists, and product managers to define quality standards and release criteria. • Document all processes, evaluation setups, specifications, and architecture diagrams. • Maintain reproducibility and traceability for all evaluation runs and datasets.

🎯 Requirements

• Advanced Python skills, including writing, debugging, and automating scripts. • Strong SQL proficiency and experience manipulating large datasets. • Hands-on experience with Python libraries such as Pandas and NumPy. • Ability to clean, standardize, and analyze structured and unstructured data. • Experience inspecting datasets, visualizing distributions, and preparing data for analysis. • Solid understanding of large language models, prompt behavior, hallucinations, and grounding concepts. • Knowledge of retrieval-augmented generation (RAG) flows and embedding-based search. • Awareness of vector similarity concepts such as cosine similarity and dot product. • Experience with at least one LLM evaluation framework (RAGAS, TruLens, LangSmith, etc.) or ability to quickly learn one. • Ability to design or implement custom LLM-as-Judge evaluation systems. • Applied understanding of statistical concepts such as variance, confidence intervals, precision/recall, and correlation. • Ability to translate ambiguous quality expectations into measurable metrics. • Familiarity with cloud-run services and automation pipelines, preferably on Google Cloud Platform (GCP). • Ability to learn new infrastructure tools quickly. • Strong analytical and problem-solving abilities for open-ended technical challenges. • Excellent communication skills for collaborating with cross-functional teams and presenting technical findings.

🏖️ Benefits

• 100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection. • Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings. • Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed. • Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock. • Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.

Apply Now

Similar Jobs

AI Operations Architect

November 10

Shadow Light Studios

1 - 10

📱 Media

🤝 B2B

AI Operations Architect designing scalable AI-powered workflows for fast-growing service agencies. Leading discovery calls with clients and collaborating with internal teams to implement solutions.

🇦🇷 Argentina – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

No-Code & AI Specialist

November 6

InnovativeDev

11 - 50

🛍️ eCommerce

🤝 B2B

☁️ SaaS

No-Code & AI Specialist designing and developing digital solutions for Interinnova. Collaborating with cross-functional teams in a remote work environment.

🇦🇷 Argentina – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🗣️🇪🇸 Spanish Required

Learning Designer – Facilitator, Creative AI

October 28

Superside

501 - 1000

🤝 B2B

☁️ SaaS

📱 Media

Learning Designer & Facilitator at Superside designing AI upskilling workshops. Partnering with customers to enhance creative teams’ AI workflow integration.

🇦🇷 Argentina – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

🤖 Artificial Intelligence

Latin American Spanish Linguist – AI Projects

October 17

Lilt

51 - 200

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Linguists with native Latin American Spanish supporting AI-training projects like content evaluation and review. Opportunity for remote freelance work with top AI developments.

🇦🇷 Argentina – Remote

💰 $55M Series C on 2022-04

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🗣️🇪🇸 Spanish Required