AI/ML Evaluation Engineer – Global Solutions Provider

November 20

Apply Now
Logo of Truelogic Software

Truelogic Software

SaaS • B2B • Enterprise

Truelogic Software is a nearshore software development company specializing in agile staff augmentation services. They focus on providing custom outsourced software development with a team of highly skilled engineers from Latin America. Truelogic Software partners with both startups and Fortune 500 companies, offering solutions that align with their clients' time zones and ensuring high-quality outcomes through collaboration and responsiveness. With a presence in over 25 countries, Truelogic emphasizes remote work for better quality of life, and their engineers are experienced in various industries, delivering a wide range of successful projects globally.

501 - 1000 employees

Founded 2004

☁️ SaaS

🤝 B2B

🏢 Enterprise

📋 Description

• Write Python and SQL scripts to evaluate outputs from large language models (LLMs). • Design and implement LLM-as-Judge evaluations with clear scoring rubrics (faithfulness, relevance, completeness, correctness). • Define and calculate metrics such as exact match, token-level F1, ROUGE, cosine similarity, and subjective rubric scores. • Build and maintain ground-truth datasets for benchmarking and regression testing. • Automate evaluation workflows and integrate them into CI/CD pipelines. • Analyze large unstructured datasets to identify inconsistencies, anomalies, biases, and missing values. • Diagnose failure modes such as hallucinations, irrelevant answers, and formatting issues. • Produce clear reports summarizing evaluation findings and quality trends. • Collaborate with AI engineers, QA, data scientists, and product managers to define quality standards and release criteria. • Document all processes, evaluation setups, specifications, and architecture diagrams. • Maintain reproducibility and traceability for all evaluation runs and datasets.

🎯 Requirements

• Advanced Python skills, including writing, debugging, and automating scripts. • Strong SQL proficiency and experience manipulating large datasets. • Hands-on experience with Python libraries such as Pandas and NumPy. • Ability to clean, standardize, and analyze structured and unstructured data. • Experience inspecting datasets, visualizing distributions, and preparing data for analysis. • Solid understanding of large language models, prompt behavior, hallucinations, and grounding concepts. • Knowledge of retrieval-augmented generation (RAG) flows and embedding-based search. • Awareness of vector similarity concepts such as cosine similarity and dot product. • Experience with at least one LLM evaluation framework (RAGAS, TruLens, LangSmith, etc.) or ability to quickly learn one. • Ability to design or implement custom LLM-as-Judge evaluation systems. • Applied understanding of statistical concepts such as variance, confidence intervals, precision/recall, and correlation. • Ability to translate ambiguous quality expectations into measurable metrics. • Familiarity with cloud-run services and automation pipelines, preferably on Google Cloud Platform (GCP). • Ability to learn new infrastructure tools quickly. • Strong analytical and problem-solving abilities for open-ended technical challenges. • Excellent communication skills for collaborating with cross-functional teams and presenting technical findings.

🏖️ Benefits

• 100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection. • Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings. • Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed. • Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock. • Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.

Apply Now

Similar Jobs

November 19

Dexco

10,000+ employees

👥 B2C

🛒 Retail

🤝 B2B

Data Coordinator at Dexco leading data-driven and AI projects. Position focuses on modern data architecture and team leadership.

🇧🇷 Brazil – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🗣️🇧🇷🇵🇹 Portuguese Required

October 29

Anota AI

51 - 200

🛍️ eCommerce

🤝 B2B

Proactive individual driving customer satisfaction by implementing features for Anota AI's restaurant solutions. Contributing to innovative projects in a dynamic, flexible work environment.

🇧🇷 Brazil – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🗣️🇧🇷🇵🇹 Portuguese Required

October 28

CI&T

5001 - 10000

🤖 Artificial Intelligence

☁️ SaaS

Leadership role in data management, focusing on AI-driven solutions and modern data ecosystems at CI&T. Responsible for developing data pipelines and enhancing data governance in the organization.

🇧🇷 Brazil – Remote

💰 $5.5M Venture Round on 2014-04

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🗣️🇧🇷🇵🇹 Portuguese Required

October 28

Code Group

51 - 200

🤖 Artificial Intelligence

🤝 B2B

🎯 Recruiter

Cloud Solutions Architect focusing on Azure projects at CODE GROUP. Requires strong Azure knowledge and experience, working in a fully remote capacity.

🇧🇷 Brazil – Remote

💵 R$12k - R$15k / month

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🗣️🇧🇷🇵🇹 Portuguese Required

October 22

EZOps Cloud | Cloud & DevOps Solutions

51 - 200

🤝 B2B

🏢 Enterprise

AI Specialist developing advanced chatbots and AI solutions for Brazilian company. Collaborating across teams while optimizing language processing technologies and user interaction.

🇧🇷 Brazil – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🤖 Artificial Intelligence

🗣️🇧🇷🇵🇹 Portuguese Required

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com