Site Reliability Engineer – Inference Infrastructure

🕒 Janeiro 13

🇨🇦 Canadá – Remoto

⏰ Tempo Integral

🟡 Pleno

🟠 Sênior

⛑ DevOps & Engenheiro de Confiabilidade do Site (SRE)

🗣️🇺🇸🇬🇧 Inglês obrigatório

Candidatar-se
Encontrar Vagas Remotas Similares

📊 Verifique sua pontuação de currículo para esta vaga

Melhore suas chances de conseguir uma entrevista verificando sua pontuação de currículo antes de se candidatar.

Logo of Cohere

Cohere

11 - 50 funcionários

🤖 Inteligência Artificial

🏢 Corporativo

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

A Cohere é uma plataforma de IA líder, fornecendo às empresas modelos de linguagem avançada e um espaço de trabalho integrado projetado para eficiência e segurança. Com uma família de modelos generativos e de recuperação de alto desempenho, a Cohere permite que as organizações simplifiquem fluxos de trabalho, melhorem a segurança dos dados e descubram insights em diversas indústrias por meio de capacidades multilingues. Seu foco em soluções de IA personalizadas garante a proteção de dados críticos, facilitando a integração perfeita nos processos organizacionais existentes.

Descrição

• Build self-service systems that automate managing, deploying and operating services. • This includes our custom Kubernetes operators that support language model deployments. • Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems. • Take steps required to ensure we hit defined SLOs, including participation in an on-call rotation. • Build strong relationships with internal developers and influence the Infrastructure team’s roadmap based on their feedback. • Develop our team through knowledge sharing and an active review process.

🎯 Requisitos

• 5+ years of engineering experience running production infrastructure at a large scale • Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters • Experience with Kubernetes dev and production coding and support • Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving • Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments • Experience in compute/storage/network resource and cost management • Excellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamwork • The grit and adaptability to solve complex technical challenges that evolve day to day • Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference. • Strong understanding or working experience with distributed systems. • Experience in Golang, C++ or other languages designed for high-performance scalable servers).

🏖️ Benefícios

• An open and inclusive culture and work environment • Work closely with a team on the cutting edge of AI research • Weekly lunch stipend, in-office lunches & snacks • Full health and dental benefits, including a separate budget to take care of your mental health • 100% Parental Leave top-up for up to 6 months • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend • 6 weeks of vacation (30 working days!)

Candidatar-se

Vagas Similares

🕒 Dezembro 16, 2025

Veeva Systems

1001 - 5000

☁️ SaaS

⚕️ Seguro de Saúde

💊 Farmacêutico

Release Engineering Manager overseeing deployment activities and managing release engineers for Veeva's SaaS products across different environments. Coordinating software releases, ensuring smooth delivery to clients while supporting the life sciences industry.

🇨🇦 Canadá – Remoto

💵 $100.000 - $175.000 / ano

⏰ Tempo Integral

🟡 Pleno

🟠 Sênior

⛑ DevOps & Engenheiro de Confiabilidade do Site (SRE)

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Novembro 11, 2025

Lazer Technologies

51 - 200

🛍️ Comércio Eletrônico

💳 Fintech

☁️ SaaS

Senior DevOps Engineer for remote-first product studio helping clients with cloud solutions. Delivering robust CI/CD pipelines and secure infrastructure with modern tools.

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Novembro 6, 2025

Kong Inc.

201 - 500

🔌 API

☁️ SaaS

🏢 Corporativo

Site Reliability Engineer responsible for operating and scaling Kong’s multi-region SaaS platform. Collaborating on infrastructure, automation, and ensuring service reliability across global regions.

🇨🇦 Canadá – Remoto

💰 $100.000.000 Series D em 2021-02

⏰ Tempo Integral

🟠 Sênior

⛑ DevOps & Engenheiro de Confiabilidade do Site (SRE)

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Outubro 14, 2025

Cerebras Systems

201 - 500

🤖 Inteligência Artificial

🔧 Hardware

⚕️ Seguro de Saúde

Sr. Deployment Engineer building and operating AI inference clusters for Cerebras Systems. Working with the world's largest AI chip to ensure scalable delivery of AI workloads.

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Outubro 7, 2025

Atolio

11 - 50

🤖 Inteligência Artificial

🏢 Corporativo

☁️ SaaS

Deployment Engineer working with engineering and client success teams at Atolio. Ensure efficient deployment of enterprise search platform in various environments.

🇨🇦 Canadá – Remoto

💵 CA$150.000 - CA$200.000 / ano

⏰ Tempo Integral

🟠 Sênior

⛑ DevOps & Engenheiro de Confiabilidade do Site (SRE)

🗣️🇺🇸🇬🇧 Inglês obrigatório