AI/ML Evaluation Engineer

November 20

Apply Now
Logo of Truelogic Software

Truelogic Software

SaaS • B2B • Enterprise

Truelogic Software is a nearshore software development company specializing in agile staff augmentation services. They focus on providing custom outsourced software development with a team of highly skilled engineers from Latin America. Truelogic Software partners with both startups and Fortune 500 companies, offering solutions that align with their clients' time zones and ensuring high-quality outcomes through collaboration and responsiveness. With a presence in over 25 countries, Truelogic emphasizes remote work for better quality of life, and their engineers are experienced in various industries, delivering a wide range of successful projects globally.

501 - 1000 employees

Founded 2004

☁️ SaaS

🤝 B2B

🏢 Enterprise

📋 Description

• Write Python and SQL scripts to evaluate outputs from large language models (LLMs). • Design and implement LLM-as-Judge evaluations with clear scoring rubrics (faithfulness, relevance, completeness, correctness). • Define and calculate metrics such as exact match, token-level F1, ROUGE, cosine similarity, and subjective rubric scores. • Build and maintain ground-truth datasets for benchmarking and regression testing. • Automate evaluation workflows and integrate them into CI/CD pipelines. • Analyze large unstructured datasets to identify inconsistencies, anomalies, biases, and missing values. • Diagnose failure modes such as hallucinations, irrelevant answers, and formatting issues. • Produce clear reports summarizing evaluation findings and quality trends. • Collaborate with AI engineers, QA, data scientists, and product managers to define quality standards and release criteria. • Document all processes, evaluation setups, specifications, and architecture diagrams. • Maintain reproducibility and traceability for all evaluation runs and datasets.

🎯 Requirements

• Advanced Python skills, including writing, debugging, and automating scripts. • Strong SQL proficiency and experience manipulating large datasets. • Hands-on experience with Python libraries such as Pandas and NumPy. • Ability to clean, standardize, and analyze structured and unstructured data. • Experience inspecting datasets, visualizing distributions, and preparing data for analysis. • Solid understanding of large language models, prompt behavior, hallucinations, and grounding concepts. • Knowledge of retrieval-augmented generation (RAG) flows and embedding-based search. • Awareness of vector similarity concepts such as cosine similarity and dot product. • Experience with at least one LLM evaluation framework (RAGAS, TruLens, LangSmith, etc.) or ability to quickly learn one. • Ability to design or implement custom LLM-as-Judge evaluation systems. • Applied understanding of statistical concepts such as variance, confidence intervals, precision/recall, and correlation. • Ability to translate ambiguous quality expectations into measurable metrics. • Familiarity with cloud-run services and automation pipelines, preferably on Google Cloud Platform (GCP). • Ability to learn new infrastructure tools quickly. • Strong analytical and problem-solving abilities for open-ended technical challenges. • Excellent communication skills for collaborating with cross-functional teams and presenting technical findings.

🏖️ Benefits

• 100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection. • Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings. • Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed. • Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock. • Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.

Apply Now

Similar Jobs

November 19

Collaborate on developing Ecommerce platforms for independent retailers in North America. Execute fast and iterate to help turn ideas into usable products.

November 15

Technical Lead overseeing development of a cutting-edge application using Laravel and Angular for ABA therapy business. Providing leadership, ensuring HIPAA compliance, and managing project timelines in a remote role.

November 13

Twilio

5001 - 10000

AI Application Analyst at Twilio designing and developing AI assistants to enhance People Team productivity. Collaborate with stakeholders to build effective AI solutions in a dynamic remote work environment.

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com