Site Reliability Engineer – Inference Infrastructure

11 - 50 employees

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

Cohere is a leading enterprise AI platform optimized for generative AI, search and discovery, and advanced retrieval. The company offers AI-powered applications designed to augment and elevate the global workforce, helping businesses thrive in the AI era. Cohere provides solutions such as embedding and reranking models, allowing enterprises to efficiently retrieve information and build powerful applications. The company offers flexible deployment options for enterprise-grade AI, on any cloud or on-premises, and provides extensive developer resources and support. Cohere is committed to scaling intelligence to serve humanity, making intelligence abundant, affordable, and accessible.

Site Reliability Engineer – Inference Infrastructure

🕒 January 13

🇨🇦 Canada – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Distributed Systems

Google Cloud Platform

Kubernetes

Linux

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Cohere

11 - 50 employees

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

📋 Description

• Build self-service systems that automate managing, deploying and operating services. • This includes our custom Kubernetes operators that support language model deployments. • Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems. • Take steps required to ensure we hit defined SLOs, including participation in an on-call rotation. • Build strong relationships with internal developers and influence the Infrastructure team’s roadmap based on their feedback. • Develop our team through knowledge sharing and an active review process.

🎯 Requirements

• 5+ years of engineering experience running production infrastructure at a large scale • Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters • Experience with Kubernetes dev and production coding and support • Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving • Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments • Experience in compute/storage/network resource and cost management • Excellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamwork • The grit and adaptability to solve complex technical challenges that evolve day to day • Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference. • Strong understanding or working experience with distributed systems. • Experience in Golang, C++ or other languages designed for high-performance scalable servers).

🏖️ Benefits

• An open and inclusive culture and work environment • Work closely with a team on the cutting edge of AI research • Weekly lunch stipend, in-office lunches & snacks • Full health and dental benefits, including a separate budget to take care of your mental health • 100% Parental Leave top-up for up to 6 months • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend • 6 weeks of vacation (30 working days!)

Apply Now

Similar Jobs

Senior Infrastructure/DevOps Engineer

🕒 November 11, 2025

Lazer Technologies

51 - 200

🛍️ eCommerce

💳 Fintech

☁️ SaaS

Senior DevOps Engineer for remote-first product studio helping clients with cloud solutions. Delivering robust CI/CD pipelines and secure infrastructure with modern tools.

🇨🇦 Canada – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Docker

Firewalls

Google Cloud Platform

JavaScript

Kubernetes

Node.js

Python

Terraform

Senior Site Reliability Engineer, Kong Konnect

🕒 November 6, 2025

Kong Inc.

201 - 500

🔌 API

☁️ SaaS

🏢 Enterprise

Site Reliability Engineer responsible for operating and scaling Kong’s multi-region SaaS platform. Collaborating on infrastructure, automation, and ensuring service reliability across global regions.

🇨🇦 Canada – Remote

💰 $100M Series D on 2021-02

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Distributed Systems

DNS

Grafana

Kafka

Kubernetes

Linux

Postgres

Prometheus

Python

Redis

Terraform

Unix

Senior Deployment Engineer – CAD

🕒 October 7, 2025

Atolio

11 - 50

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Deployment Engineer working with engineering and client success teams at Atolio. Ensure efficient deployment of enterprise search platform in various environments.

🇨🇦 Canada – Remote

💵 CA$150k - CA$200k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Google Cloud Platform

Grafana

Kubernetes

Python

ServiceNow

Splunk

Terraform

DevOps Engineer

🕒 September 19, 2025

Veeva Systems

1001 - 5000

☁️ SaaS

⚕️ Healthcare Insurance

💊 Pharmaceuticals

DevOps Engineer building scalable cloud and CI/CD infrastructure for Veeva Systems' life sciences SaaS. Focus on IaC, automation, Kubernetes, Terraform, and reliability.

🇨🇦 Canada – Remote

💵 CA$85k - CA$225k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Cloud

Distributed Systems

Docker

Java

Jenkins

Kubernetes

OpenShift

Python

Scala

Terraform

DevOps Engineer

🕒 September 16, 2025

Veeva Systems

1001 - 5000

☁️ SaaS

⚕️ Healthcare Insurance

💊 Pharmaceuticals