Senior ML Systems Engineer, Frameworks & Tooling

🕒 December 1, 2025

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Cohere

Cohere

11 - 50 employees

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

Cohere is a leading enterprise AI platform optimized for generative AI, search and discovery, and advanced retrieval. The company offers AI-powered applications designed to augment and elevate the global workforce, helping businesses thrive in the AI era. Cohere provides solutions such as embedding and reranking models, allowing enterprises to efficiently retrieve information and build powerful applications. The company offers flexible deployment options for enterprise-grade AI, on any cloud or on-premises, and provides extensive developer resources and support. Cohere is committed to scaling intelligence to serve humanity, making intelligence abundant, affordable, and accessible.

📋 Description

• Build and own the training framework responsible for large-scale LLM training. • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing). • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100). • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics. • Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training. • Investigate and resolve performance bottlenecks across the ML systems stack. • Build robust systems that ensure reproducible, debuggable, large-scale runs.

🎯 Requirements

• Strong engineering experience in large-scale distributed training or HPC systems. • Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops. • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar). • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines. • Experience working with containerized environments (Docker, Singularity/Apptainer). • A track record of building tools that increase developer velocity for ML teams. • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability. • Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.

🏖️ Benefits

• An open and inclusive culture and work environment • Work closely with a team on the cutting edge of AI research • Weekly lunch stipend, in-office lunches & snacks • Full health and dental benefits, including a separate budget to take care of your mental health • 100% Parental Leave top-up for up to 6 months • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend • 6 weeks of vacation (30 working days!)

Apply Now

Similar Jobs

🕒 September 21, 2025

Assystem

5001 - 10000

Lead systems integration for nuclear and fusion projects at Assystem. Coordinate architectures, requirements, MBSE, risk assessments, and stakeholder liaison.

Swift