Post a Job Affiliates

Search Remote Jobs

Reddit, Inc.

Website LinkedIn All Job Openings

501 - 1000 employees

Founded 2005

👥 B2C

📱 Media

🌍 Social Impact

B2C • Media • Social Impact

Reddit, Inc. is a social media platform that acts as a hub for thousands of communities, where users can engage in diverse conversations ranging from breaking news to niche interests. It enables users to post, comment, and vote on content, fostering a vibrant online community. Millions of people globally connect and share their passions on Reddit, creating a dynamic environment for authentic human interaction.

Staff Research Engineer – Post-training & Evaluation

🕒 3 days ago

🇺🇸 United States – Remote

💵 $230k - $322k / year

⏰ Full Time

🔴 Lead

📚 Research Engineer

Python

PyTorch

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Reddit, Inc.

Website LinkedIn All Job Openings

501 - 1000 employees

Founded 2005

👥 B2C

📱 Media

🌍 Social Impact

B2C • Media • Social Impact

📋 Description

• Define the 'Reddit Benchmark' evaluation standard: Own the methodology — not just the harness — for rigorously measuring model quality across Safety, Reasoning, representation/retrieval, and Reddit-specific knowledge. Decide what 'Reddit-native' means in measurable terms and set the bar the org trains against. • Own evaluation reliability and statistical rigor: Establish the science behind trustworthy evals — judge variance, multi-sample scoring, inter-rater/inter-sample agreement, sampling and temperature effects, and calibration of automated judges. You are accountable for whether a benchmark delta is real or noise. Drive the practice of evaluation as a release gate — offline against frozen datasets, and pre-merge in CI/CD — so regressions are caught before endpoints ship. • Design model-as-a-judge methodology: Own judge selection, prompt design, calibration, and reliability for automated evaluation using frontier external models, enabling rapid, trustworthy iteration cycles. • Set post-training recipes and strategy: Design SFT recipes (data mixtures, curriculum, ablation strategy) that convert base models into helpful, well-aligned endpoints; partner with engineering to scale them. • Evaluate base and CPT checkpoints, not just endpoints: Design checkpoint-selection methodology across CPT experiments and LR studies, so we pick the right base before committing post-training compute. • Drive synthetic data generation strategy: Define and curate high-quality instruction and evaluation sets to improve generalization where human data is scarce. • Partner with Safety Engineering: Translate high-level safety policy into concrete classification metrics, probe sets, and CI/CD unit tests — including precision/recall at threshold, label-noise handling, and false-positive taxonomy for abuse detection (HHV). • Diagnose post-training instability: Dive into loss curves and eval logs to identify alignment tax and capability degradation, and recommend the fix. • Lead research direction: Set technical direction for evaluation and post-training across the team, mentor engineers and scientists, and represent the work internally (and externally where appropriate).

🎯 Requirements

• 6+ years of professional ML experience (or PhD + 4+) with a direct focus on LLM post-training and evaluation. • PhD or MS in CS, ML, NLP, IR, or a related quantitative field — or equivalent industry research experience. • Deep expertise in evaluation reliability: judge/sample variance, multi-sample scoring, calibration, statistical significance, and the failure modes of automated evaluation. • Strong experience building custom, domain-specific evaluation harnesses (e.g., lm-eval-harness, Inspect AI, LightEval) — you know the strengths and limits of benchmarks like MMLU and GSM8K and when they don't apply, and you treat eval sets as versioned, frozen, regression-tracked code. • Experience evaluating both generation and representation/classification: model-as-a-judge for generative quality and precision/recall, PR-AUC, retrieval/MTEB-style metrics, gold-label denoising, and label-noise handling. • Deep understanding of Continuous Pre-training (CPT), Instruction Tuning (SFT), and how data quality shapes model behavior. • Fluency in Python; strong data-pipeline and eval-harness engineering (e.g., Hugging Face Transformers, vLLM, lm-eval-harness). Working knowledge of PyTorch and distributed training (FSDP2, DeepSpeed ZeRO-3) sufficient to direct and debug post-training runs.

🏖️ Benefits

• Comprehensive Healthcare Benefits and Income Replacement Programs • 401k with Employer Match • Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support • Family Planning Support • Gender-Affirming Care • Mental Health & Coaching Benefits • Flexible Vacation & Paid Volunteer Time Off • Generous Paid Parental Leave

Apply Now

Similar Jobs

Staff Threat Research Engineer

🕒 March 26

Sumo Logic

501 - 1000

☁️ SaaS

🔒 Cybersecurity

Website LinkedIn All Job Openings

Staff Threat Research Engineer conducting original investigations and improving customer outcomes at Sumo Logic. Collaborating across teams to enhance security through advanced threat research.

🇺🇸 United States – Remote

💵 $162k - $190k / year

💰 $110M Series G on 2019-05

⏰ Full Time

🔴 Lead

📚 Research Engineer

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Cyber Security

Google Cloud Platform

Open Source

Python

Apply

View Job

Staff Research Engineer, Model Efficiency

🕒 November 8, 2025

Cohere

11 - 50

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Website LinkedIn All Job Openings

Staff Research Engineer developing techniques for improving speed and efficiency of AI models at Cohere. Join a diverse team focused on advanced AI research and model optimization.

🇺🇸 United States – Remote

⏰ Full Time

🔴 Lead

📚 Research Engineer

🦅 H1B Visa Sponsor

Apply

View Job