Member of Engineering – Pre-training, Synthetic Data

🕒 Janeiro 29

🇺🇸 Estados Unidos – Remoto (EUA)

⏰ Tempo Integral

🟡 Pleno

🟠 Sênior

🖥 Engenheiro de Software

🗣️🇺🇸🇬🇧 Inglês obrigatório

Candidatar-se
Encontrar Vagas Remotas Similares

📊 Verifique sua pontuação de currículo para esta vaga

Melhore suas chances de conseguir uma entrevista verificando sua pontuação de currículo antes de se candidatar.

Logo of poolside

poolside

51 - 200 funcionários

Fundada em 2023

🤖 Inteligência Artificial

🏢 Corporativo

Artificial Intelligence • Enterprise

Poolside é uma aceleradora projetada especificamente para fundadores e builders de Web3. Ela oferece suporte a projetos de finanças descentralizadas (DeFi), games, governança, infraestrutura e NFTs. Com um ecossistema robusto de 20. 000 membros — incluindo mentores, investidores e builders de Web3 — a Poolside co-lançou e apoiou mais de 110 projetos. A aceleradora proporciona acesso diferenciado a mentoria e expertise técnica para ajudar projetos Web3 a escalar e alcançar lançamentos bem-sucedidos. A Poolside também se engaja com empresas e protocolos líderes para impulsionar o crescimento e a inovação no espaço Web3.

Descrição

• You’ll be working on our data team focused on the quality of the datasets being delivered for training our models. • This is a hands-on role where your #1 mission would be to improve the quality of the pretraining datasets by leveraging your previous experience, intuition and training experiments. • This role particularly focuses on generating synthetic data at scale and determining the best strategies to leverage such data into training large models. • You’ll closely collaborate with other teams like Pretraining, Postraining, Evals, and Product to define high-quality data needs that map to missing model capabilities and downstream use cases. • Staying in sync with the latest research in synthetic data generation and pretraining is key to success in this role. • You will constantly lead original research initiatives through short, time-bounded experiments while deploying highly technical engineering solutions into production. • With the volumes of data to process being massive, you'll have a performant distributed data pipeline together with a large GPU cluster at your disposal. • To deliver large, high-quality, and diverse synthetic datasets mixing natural language and code modalities to train best-in-class coding agents.

🎯 Requisitos

• Strong machine learning and engineering background • Experience with Large Language Models (LLM) • Understanding of how LLMs learn • Data ablations and scaling laws • Post-training techniques • Training reasoning and agentic models • Experience with implementing cost-efficient, complex pipelines to generate synthetical datasets at scale optimizing for data quality, correctness, diversity, etc. • Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc) • Experience in building trillion-scale pretraining datasets, and familiarity with concepts like data curation, deduplication, data mixing, tokenization, curriculum, impact of data repetition, etc. • Excellent programming skills in Python • Strong prompt engineering skills • Experience working with large-scale GPU clusters and distributed data pipelines • Strong obsession with data quality • Research experience: Author of scientific papers on any of the topics: applied deep learning, LLMs, source code generation, etc. - is a nice to have • Can freely discuss the latest papers and descend to fine details • Is reasonably opinionated

🏖️ Benefícios

• Fully remote work & flexible hours • 37 days/year of vacation & holidays • Health insurance allowance for you and dependents • Company-provided equipment • Wellbeing, always-be-learning and home office allowances • Frequent team get togethers • Great diverse & inclusive people-first culture

Candidatar-se

Vagas Similares

🕒 Janeiro 28

Endava

10.000+ funcionários

🏢 Corporativo

Senior Synon Developer involved in enhancing RxCLAIM/Claim Adjudication systems. Collaborating on changes to claim processing logic and integrations for a tech-forward company.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $120.000 - $140.000 / ano

💰 Post-IPO Debt em 2023-02

⏰ Tempo Integral

🟠 Sênior

🖥 Engenheiro de Software

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Janeiro 28

Helix Workforce

11 - 50

🎯 Recrutamento

🤝 B2B

Junior/Mid-level CRM Developer responsible for designing and maintaining CRM software solutions. Join a dynamic team to develop applications for Android and iOS platforms.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $480.000 - $600.000 / ano

⏰ Tempo Integral

🟡 Pleno

🟠 Sênior

🖥 Engenheiro de Software

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Janeiro 27

Harness

501 - 1000

☁️ SaaS

🔒 Cibersegurança

Webflow Developer optimizing marketing website for Harness' AI-powered software delivery platform. Collaborating with teams to ensure seamless and engaging user experiences while maintaining design integrity.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $105.000 - $120.000 / ano

⏰ Tempo Integral

🟡 Pleno

🟠 Sênior

🖥 Engenheiro de Software

🦅 Patrocina Visto H1B

info

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Janeiro 27

Miratech

501 - 1000

Developing IVR applications for voice contact center systems at Miratech. Collaborating with teams to enhance customer experience through technical improvements.

🇺🇸 Estados Unidos – Remoto (EUA)

💰 Private Equity Round em 2022-04

⏰ Tempo Integral

🟠 Sênior

🖥 Engenheiro de Software

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Janeiro 24

OpenRouter

1 - 10

🤖 Inteligência Artificial

☁️ SaaS

📚 Educação

Founding Product Marketer for OpenRouter, focusing on developer messaging and AI content systems. Lead product launches and create engaging content for technical audiences.

🇺🇸 Estados Unidos – Remoto (EUA)

⏰ Tempo Integral

🟠 Sênior

🖥 Engenheiro de Software

🗣️🇺🇸🇬🇧 Inglês obrigatório