Member of Engineering – Pre-training, Synthetic Data

🕒 il y a 4 mois

🇺🇸 États-Unis – Télétravail

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

🖥 Ingénieur Logiciel

🗣️🇺🇸🇬🇧 Anglais requis

Postuler Maintenant
Trouver des Emplois à Distance Similaires

📊 Vérifiez votre score de CV pour ce poste

Améliorez vos chances d'obtenir un entretien en vérifiant votre score de CV avant de postuler.

Logo of poolside

poolside

51 - 200 employés

Fondée en 2023

🤖 Intelligence artificielle

🏢 Entreprise

Artificial Intelligence • Enterprise

Poolside est un accélérateur spécifiquement conçu pour les fondateurs et builders Web3. Il apporte un soutien aux projets en finance décentralisée (DeFi), gaming, gouvernance, infrastructure et NFT. Fort d’un écosystème de 20 000 membres, incluant des mentors, des investisseurs et des builders Web3, Poolside a co-lancé et accompagné plus de 110 projets. L’accélérateur offre un accès privilégié au mentorat et à l’expertise technique pour aider les projets Web3 à scaler et à réussir leurs lancements. Poolside collabore également avec des entreprises et des protocoles de premier plan pour stimuler la croissance et l’innovation dans l’écosystème Web3.

Description

• You’ll be working on our data team focused on the quality of the datasets being delivered for training our models. • This is a hands-on role where your #1 mission would be to improve the quality of the pretraining datasets by leveraging your previous experience, intuition and training experiments. • This role particularly focuses on generating synthetic data at scale and determining the best strategies to leverage such data into training large models. • You’ll closely collaborate with other teams like Pretraining, Postraining, Evals, and Product to define high-quality data needs that map to missing model capabilities and downstream use cases. • Staying in sync with the latest research in synthetic data generation and pretraining is key to success in this role. • You will constantly lead original research initiatives through short, time-bounded experiments while deploying highly technical engineering solutions into production. • With the volumes of data to process being massive, you'll have a performant distributed data pipeline together with a large GPU cluster at your disposal. • To deliver large, high-quality, and diverse synthetic datasets mixing natural language and code modalities to train best-in-class coding agents.

🎯 Exigences

• Strong machine learning and engineering background • Experience with Large Language Models (LLM) • Understanding of how LLMs learn • Data ablations and scaling laws • Post-training techniques • Training reasoning and agentic models • Experience with implementing cost-efficient, complex pipelines to generate synthetical datasets at scale optimizing for data quality, correctness, diversity, etc. • Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc) • Experience in building trillion-scale pretraining datasets, and familiarity with concepts like data curation, deduplication, data mixing, tokenization, curriculum, impact of data repetition, etc. • Excellent programming skills in Python • Strong prompt engineering skills • Experience working with large-scale GPU clusters and distributed data pipelines • Strong obsession with data quality • Research experience: Author of scientific papers on any of the topics: applied deep learning, LLMs, source code generation, etc. - is a nice to have • Can freely discuss the latest papers and descend to fine details • Is reasonably opinionated

🏖️ Avantages

• Fully remote work & flexible hours • 37 days/year of vacation & holidays • Health insurance allowance for you and dependents • Company-provided equipment • Wellbeing, always-be-learning and home office allowances • Frequent team get togethers • Great diverse & inclusive people-first culture

Postuler Maintenant

Emplois Similaires

🕒 il y a 4 mois

Endava

10 000+ employés

🏢 Entreprise

Senior Synon Developer involved in enhancing RxCLAIM/Claim Adjudication systems. Collaborating on changes to claim processing logic and integrations for a tech-forward company.

🇺🇸 États-Unis – Télétravail

💵 $120 000 - $140 000 / an

💰 Post-IPO Debt en 2023-02

⏰ Temps Plein

🟠 Senior

🖥 Ingénieur Logiciel

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 4 mois

Helix Workforce

11 - 50

🎯 Recrutement

🤝 B2B

Junior/Mid-level CRM Developer responsible for designing and maintaining CRM software solutions. Join a dynamic team to develop applications for Android and iOS platforms.

🇺🇸 États-Unis – Télétravail

💵 $480 000 - $600 000 / an

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

🖥 Ingénieur Logiciel

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 4 mois

Harness

501 - 1000

☁️ SaaS

🔒 Cybersecurity

Webflow Developer optimizing marketing website for Harness' AI-powered software delivery platform. Collaborating with teams to ensure seamless and engaging user experiences while maintaining design integrity.

🇺🇸 États-Unis – Télétravail

💵 $105 000 - $120 000 / an

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

🖥 Ingénieur Logiciel

🦅 Parrain de Visa H1B

info

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 4 mois

Miratech

501 - 1000

Developing IVR applications for voice contact center systems at Miratech. Collaborating with teams to enhance customer experience through technical improvements.

🇺🇸 États-Unis – Télétravail

💰 Private Equity Round en 2022-04

⏰ Temps Plein

🟠 Senior

🖥 Ingénieur Logiciel

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 4 mois

OpenRouter

1 - 10

🤖 Intelligence artificielle

☁️ SaaS

📚 Éducation

Founding Product Marketer for OpenRouter, focusing on developer messaging and AI content systems. Lead product launches and create engaging content for technical audiences.

🇺🇸 États-Unis – Télétravail

⏰ Temps Plein

🟠 Senior

🖥 Ingénieur Logiciel

🗣️🇺🇸🇬🇧 Anglais requis