Site Reliability Engineer – Inference Infrastructure

🕒 January 13

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Cohere

Cohere

11 - 50 employees

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

Cohere is a leading enterprise AI platform optimized for generative AI, search and discovery, and advanced retrieval. The company offers AI-powered applications designed to augment and elevate the global workforce, helping businesses thrive in the AI era. Cohere provides solutions such as embedding and reranking models, allowing enterprises to efficiently retrieve information and build powerful applications. The company offers flexible deployment options for enterprise-grade AI, on any cloud or on-premises, and provides extensive developer resources and support. Cohere is committed to scaling intelligence to serve humanity, making intelligence abundant, affordable, and accessible.

📋 Description

• Build self-service systems that automate managing, deploying and operating services. • This includes our custom Kubernetes operators that support language model deployments. • Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems. • Take steps required to ensure we hit defined SLOs, including participation in an on-call rotation. • Build strong relationships with internal developers and influence the Infrastructure team’s roadmap based on their feedback. • Develop our team through knowledge sharing and an active review process.

🎯 Requirements

• 5+ years of engineering experience running production infrastructure at a large scale • Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters • Experience with Kubernetes dev and production coding and support • Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving • Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments • Experience in compute/storage/network resource and cost management • Excellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamwork • The grit and adaptability to solve complex technical challenges that evolve day to day • Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference. • Strong understanding or working experience with distributed systems. • Experience in Golang, C++ or other languages designed for high-performance scalable servers).

🏖️ Benefits

• An open and inclusive culture and work environment • Work closely with a team on the cutting edge of AI research • Weekly lunch stipend, in-office lunches & snacks • Full health and dental benefits, including a separate budget to take care of your mental health • 100% Parental Leave top-up for up to 6 months • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend • 6 weeks of vacation (30 working days!)

Apply Now

Similar Jobs

🕒 December 16, 2025

Veeva Systems

1001 - 5000

☁️ SaaS

⚕️ Healthcare Insurance

💊 Pharmaceuticals

Release Engineering Manager overseeing deployment activities and managing release engineers for Veeva's SaaS products across different environments. Coordinating software releases, ensuring smooth delivery to clients while supporting the life sciences industry.

Ansible

AWS

Cloud

Jenkins

Python

SDLC

🕒 November 11, 2025

Lazer Technologies

51 - 200

🛍️ eCommerce

💳 Fintech

☁️ SaaS

Senior DevOps Engineer for remote-first product studio helping clients with cloud solutions. Delivering robust CI/CD pipelines and secure infrastructure with modern tools.

AWS

Cloud

Docker

Firewalls

Google Cloud Platform

JavaScript

Kubernetes

Node.js

Python

Terraform

Go

🕒 November 6, 2025

Kong Inc.

201 - 500

🔌 API

☁️ SaaS

🏢 Enterprise

Site Reliability Engineer responsible for operating and scaling Kong’s multi-region SaaS platform. Collaborating on infrastructure, automation, and ensuring service reliability across global regions.

Cloud

Distributed Systems

DNS

Grafana

Kafka

Kubernetes

Linux

Postgres

Prometheus

Python

Redis

Terraform

Unix

Go

🕒 October 14, 2025

Cerebras Systems

201 - 500

🤖 Artificial Intelligence

🔧 Hardware

⚕️ Healthcare Insurance

Sr. Deployment Engineer building and operating AI inference clusters for Cerebras Systems. Working with the world's largest AI chip to ensure scalable delivery of AI workloads.

AWS

Docker

Grafana

Kubernetes

Linux

Prometheus

Python

🕒 October 7, 2025

Atolio

11 - 50

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Deployment Engineer working with engineering and client success teams at Atolio. Ensure efficient deployment of enterprise search platform in various environments.

AWS

Azure

Cloud

Google Cloud Platform

Grafana

Kubernetes

Python

ServiceNow

Splunk

Terraform

Go