Model Evaluation QA Lead

🕒 February 9

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Deepgram

Deepgram

51 - 200 employees

Founded 2015

🤖 Artificial Intelligence

☁️ SaaS

🔌 API

💰 $47M Series B on 2022-11

Artificial Intelligence • SaaS • API

Deepgram is a leading voice AI company that provides powerful APIs for speech-to-text, text-to-speech, and language understanding applications. Their platform enables developers to build sophisticated voice AI solutions for use cases such as contact centers, medical transcription, conversational AI, and more. Known for unmatched accuracy, speed, and cost-effectiveness, Deepgram's technology is trusted by top enterprises and startups worldwide. By offering real-time and highly accurate transcription capabilities, Deepgram helps businesses gain insights from voice data, making it an essential tool for transforming voice interactions.

📋 Description

• Model Evaluation Automation: Design, build, and maintain automated model evaluation pipelines that run against every candidate model before release. Implement objective and subjective quality metrics (WER, SER, MOS, latency/throughput) across STT, TTS, and STS product lines. • Release Gate Integration: Embed model quality checkpoints into CI/CD and release pipelines. Define pass/fail criteria, build dashboards for model comparison, and own the go/no-go signal for model promotions to production. • Agent & Model Evaluation Frameworks: Stand up and operate evaluation tooling (Coval, Braintrust, Blue Jay, custom harnesses) for end-to-end voice agent testing—covering accuracy, latency, turn-taking, and conversational quality and custom metrics across real-world scenarios. • Active Learning & Data Ingestion Testing: Partner with the Active Learning team to validate data ingestion infrastructure, annotation pipelines, and retraining automation. Ensure data quality standards are met at every stage of the flywheel. • Industry Benchmark Automation: Automate execution and reporting of industry-standard benchmarks (e.g., LibriSpeech, CommonVoice, internal production-traffic evals). Maintain reproducible benchmark environments and publish results for internal consumption. • Language & Domain Validation: Build and maintain test suites for multi-language and domain-specific model validation. Design coverage matrices that ensure new languages and acoustic domains are systematically evaluated before GA. • Retraining Automation Support: Validate the end-to-end retraining pipeline across all data sources—from data selection and preprocessing through training, evaluation, and promotion—ensuring automation reliability and correctness. • Manual Test Feedback Loop: Design and operate human-in-the-loop evaluation workflows for subjective quality assessment. Build the tooling and processes that translate human feedback into actionable quality signals for the ML team.

🎯 Requirements

• 4–7 years of experience in QA engineering, ML evaluation, or a related technical role with a focus on predictive and generative model and data quality. • Hands-on experience building automated test/evaluation pipelines for ML models and connecting software features. • Strong programming skills in Python; experience with ML evaluation libraries, data processing frameworks (Pandas, NumPy), and scripting for pipeline automation. • Familiarity with speech/audio ML concepts: WER, SER, MOS, acoustic models, language models, or similar evaluation metrics. • Experience with CI/CD integration for ML workflows (e.g., GitHub Actions, Jenkins, Argo, MLflow, or equivalent). • Ability to design and maintain reproducible benchmark environments across multiple model versions and configurations. • Strong communication skills—you can translate model quality metrics into actionable insights for engineering, research, and product stakeholders. • Detail-oriented and systematic, with a bias toward automation over manual process.

🏖️ Benefits

• Medical, dental, vision benefits • Annual wellness stipend • Mental health support • Life, STD, LTD Income Insurance Plans • Unlimited PTO • Generous paid parental leave • Flexible schedule • 12 Paid US company holidays • Quarterly personal productivity stipend • One-time stipend for home office upgrades • 401(k) plan with company match • Tax Savings Programs • Learning / Education stipend • Participation in talks and conferences • Employee Resource Groups • AI enablement workshops / sessions

Apply Now

Similar Jobs

🕒 February 5

Quality Assurance Analyst ensuring reliability of AI detection platform at GPTZero by testing web applications and creating error reports. Collaborating with engineering teams to enhance product robustness.

🕒 February 4

Capital Rx

501 - 1000

⚕️ Healthcare Insurance

💳 Fintech

🤖 Artificial Intelligence

Technical QA Analyst II contributing to delivering high-quality products for Capital Rx JUDI platform. Proactively identifying technical issues and enhancing user experience through collaboration.

AWS

Python

SQL

🕒 January 29

Cybermedia Technologies, LLC (CTEC)

201 - 500

🔒 Cybersecurity

Mid-level Quality Assurance Engineer at CTEC focusing on software testing for U.S. Federal Government projects. Collaborating with cross-functional teams to ensure quality delivery of mission-critical systems.

Azure

JUnit

SDLC

Selenium

🕒 January 29

Cybermedia Technologies, LLC (CTEC)

201 - 500

🔒 Cybersecurity

Senior Quality Assurance Engineer at CyberMedia Technologies ensuring software quality across enterprise systems. Leading comprehensive test strategies and overseeing defect management with Agile teams.

Azure

JUnit

SDLC

Selenium

🕒 January 24

Blockstream

51 - 200

₿ Crypto

🔧 Hardware

🏢 Enterprise

QA Engineering Manager at Blockstream overseeing QA team to ensure quality and security of blockchain applications. Leading test strategies and coordinating with multiple teams.

Cloud