Senior ML Systems Engineer, Frameworks &amp; Tooling

Artificial Intelligence • Enterprise • SaaS

Cohere is a leading enterprise AI platform optimized for generative AI, search and discovery, and advanced retrieval. The company offers AI-powered applications designed to augment and elevate the global workforce, helping businesses thrive in the AI era. Cohere provides solutions such as embedding and reranking models, allowing enterprises to efficiently retrieve information and build powerful applications. The company offers flexible deployment options for enterprise-grade AI, on any cloud or on-premises, and provides extensive developer resources and support. Cohere is committed to scaling intelligence to serve humanity, making intelligence abundant, affordable, and accessible.

11 - 50 employees

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Senior ML Systems Engineer, Frameworks & Tooling

3 days ago

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟠 Senior

⚙️ Systems Engineer

🇬🇧 UK Skilled Worker Visa Sponsor

Docker

Kubernetes

Node.js

Ray

Apply Now

Cohere

Artificial Intelligence • Enterprise • SaaS

11 - 50 employees

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

📋 Description

• Build and own the training framework responsible for large-scale LLM training. • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing). • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100). • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics. • Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training. • Investigate and resolve performance bottlenecks across the ML systems stack. • Build robust systems that ensure reproducible, debuggable, large-scale runs.

🎯 Requirements

• Strong engineering experience in large-scale distributed training or HPC systems. • Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops. • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar). • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines. • Experience working with containerized environments (Docker, Singularity/Apptainer). • A track record of building tools that increase developer velocity for ML teams. • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability. • Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.

🏖️ Benefits

• An open and inclusive culture and work environment • Work closely with a team on the cutting edge of AI research • Weekly lunch stipend, in-office lunches & snacks • Full health and dental benefits, including a separate budget to take care of your mental health • 100% Parental Leave top-up for up to 6 months • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend • 6 weeks of vacation (30 working days!)

Apply Now

Similar Jobs

Embedded Engineer

November 26

Arrow Components

10,000+ employees

Embedded Engineer at eInfochips developing real time embedded software and firmware for clients. Responsibilities include software testing, documentation, and analysis of technical requirements.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⚙️ Systems Engineer

Linux

IT Systems Engineer

November 20

Perforce Software

1001 - 5000

🏢 Enterprise

☁️ SaaS

⚡ Productivity

IT Systems Engineer at Perforce managing flexible infrastructure components for public and private solutions. Collaborating within the IT infrastructure team in the Bracknell office.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⚙️ Systems Engineer

🇬🇧 UK Skilled Worker Visa Sponsor

AWS

Azure

Cloud

Google Cloud Platform

Linux

Puppet

Terraform

VMware

Systems Engineer

November 18

Saab

10,000+ employees

🚀 Aerospace

🔐 Security

🏛️ Government

Systems Engineer for Public Safety Solutions deploying and maintaining SAFE operating environments. Providing technical support for mission-critical control room operations in the United Kingdom.

🇬🇧 United Kingdom – Remote

💰 $71k Grant on 2014-09

⏰ Full Time

🟡 Mid-level

🟠 Senior

⚙️ Systems Engineer

🇬🇧 UK Skilled Worker Visa Sponsor

Ansible

Cloud

Firewalls

Linux

MS SQL Server

Python

SQL

Switching

TCP/IP

Terraform

VMware

VoIP

Senior IT Systems Engineer

November 9

appNovi, Inc (A Fenix24 Company)

1 - 10

🔒 Cybersecurity

☁️ SaaS

🤝 B2B

Senior IT Systems Engineer managing ransomware restoration events for global clients affected by cyber threats. Leading technical teams and providing oversight for successful recovery operations across computing infrastructures.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟠 Senior

⚙️ Systems Engineer

Cloud

Linux

Switching

VMware

Systems Engineer, French Speaking

November 9

appNovi, Inc (A Fenix24 Company)

1 - 10

🔒 Cybersecurity

☁️ SaaS

🤝 B2B

Technical Engineer providing cyber disaster recovery and incident response support. Working with clients to restore services and maintain cybersecurity infrastructure.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⚙️ Systems Engineer

🗣️🇫🇷 French Required

AWS

Azure

Citrix

Cloud

DNS

Firewalls

VMware