
Artificial Intelligence • B2B • SaaS
Grupo Protege is an AI training data platform that connects AI developers with high-quality, ethically sourced training data. It serves both AI developers by providing a vast and rich collection of data for model training and data holders by enabling them to monetize their data while maintaining governance and control. The platform aims to streamline the data procurement process significantly, making it easier for developers to access the data they need efficiently.
July 18

Artificial Intelligence • B2B • SaaS
Grupo Protege is an AI training data platform that connects AI developers with high-quality, ethically sourced training data. It serves both AI developers by providing a vast and rich collection of data for model training and data holders by enabling them to monetize their data while maintaining governance and control. The platform aims to streamline the data procurement process significantly, making it easier for developers to access the data they need efficiently.
• Data is the foundation of AI performance, and we believe model quality starts with data quality. • You’ll be at the heart of shaping how we curate, assess, and prepare the training data that powers real-world AI systems. • We’re seeking a Senior Member of the Core Data Team/ Principal Scientist to lead the evaluation and optimization of large-scale datasets used to train state-of-the-art AI models. • In this role, you’ll help define what "high-quality data" means in practice, using statistical, computational, and ML-driven methods to ensure our data is diverse, representative, and high-impact. • You’ll work closely with research and engineering teams to improve model performance through better data. • This is an ideal role for someone with a PhD in machine learning, CS, or a related applied field who is passionate about the role of data in AI training and excited to advance Protege’s mission to become the ubiquitous platform for AI training data.
• PhD or equivalent Master's Degree + 4+ years industry experience in machine learning, economics, mathematics, engineering, computer science, statistics, or a related quantitative field • Strong understanding of AI model training pipelines, including pre-processing and evaluation • Experience working with large, unstructured datasets, especially text • Background in statistical analysis, bias detection, and data validation • Able to identify high-impact problems and drive independent solutions • Bonus if you have these attributes • Experience with synthetic data generation or augmentation strategies • Publications or open-source contributions in data-centric AI or related areas • Experience developing evaluation frameworks or performance metrics for training data • Cross-functional collaboration with product, infrastructure, or partnership teams
Apply Now