Principal Engineer, AI Inference Reliability

Artificial Intelligence • Hardware • Healthcare Insurance

Cerebras Systems is a pioneering company that focuses on developing advanced AI hardware, specifically the Cerebras Wafer Scale Engine, which delivers unparalleled performance in AI inference, outperforming traditional GPU setups. Their cutting-edge technology enables organizations like Mayo Clinic and AlphaSense to run state-of-the-art AI models with remarkable speed and efficiency. With flexible deployment options including cloud and on-premises solutions, Cerebras is transforming the landscape of AI capabilities for innovative teams across various industries.

201 - 500 employees

Founded 2016

🤖 Artificial Intelligence

🔧 Hardware

⚕️ Healthcare Insurance

Principal Engineer, AI Inference Reliability

Job not on LinkedIn

October 30

🇨🇦 Canada – Remote

⏰ Full Time

🔴 Lead

🧑‍💻 Full-stack Engineer

Distributed Systems

Python

Rust

Apply Now

Cerebras Systems

Artificial Intelligence • Hardware • Healthcare Insurance

201 - 500 employees

Founded 2016

🤖 Artificial Intelligence

🔧 Hardware

⚕️ Healthcare Insurance

📋 Description

• Define and drive reliability strategy: establish SLOs and ensure alignment across engineering. • Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers. • Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents. • Architect for reliability and observability: influence system design for redundancy, durability, and debuggability. • Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection. • Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service. • Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights. • Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems.

🎯 Requirements

• Bachelor's or master's degree in computer science or related field. • 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems. • Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust. • Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture. • Excellent communication and cross-functional leadership skills. • Bonus: prior experience building large-scale AI infrastructure systems.

🏖️ Benefits

• Health insurance • 401(k) matching • Flexible work hours • Paid time off • Professional development opportunities

Apply Now

Similar Jobs

Software Architect

October 29

Phreesia

1001 - 5000

☁️ SaaS

Software Architect leading technical design for patient registration systems in healthcare technology. Working with a distributed team to improve patient activation and healthcare outcomes.

🇨🇦 Canada – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

🧑‍💻 Full-stack Engineer

Cloud

Java

SDLC

Principal Software Engineer – Platform Team

October 28

Brinqa

51 - 200

🔒 Cybersecurity

Principal Software Engineer developing risk management solutions on Brinqa's Cybersecurity Knowledge Platform. Collaborating with product teams for large enterprise customers while leading technical excellence.

🇨🇦 Canada – Remote

💰 Private Equity Round on 2021-06

⏰ Full Time

🔴 Lead

🧑‍💻 Full-stack Engineer

Groovy

Java

Spring

Spring Boot

SpringBoot

Software Engineer

October 28

Broadcom

10,000+ employees

🔧 Hardware

📡 Telecommunications

☁️ SaaS

Principal Software Engineer designing and delivering high-performance control software. Leading multi-faceted teams and ensuring product sustainability through lifecycle ownership.

🇨🇦 Canada – Remote

💰 Post-IPO Equity on 2017-10

⏰ Full Time

🟠 Senior

🔴 Lead

🧑‍💻 Full-stack Engineer

Staff Software Development Engineer

October 28

Kong Inc.

201 - 500

🔌 API

☁️ SaaS

🏢 Enterprise

Software Development Engineer collaborating on API tools & services in the API Management group at Kong. Responsible for developing software components to enhance API discoverability and documentation.

🇨🇦 Canada – Remote

💰 $100M Series D on 2021-02

⏰ Full Time

🔴 Lead

🧑‍💻 Full-stack Engineer

Distributed Systems

Docker

JavaScript

Kafka

Kubernetes

Microservices

Postgres

TypeScript

Staff Software Developer – Credit

October 28

KOHO

201 - 500

💳 Fintech

🛍️ eCommerce

🏦 Banking

Staff Developer solving complex technical challenges in Credit for KOHO, a financial services company. Focus on lending products, delivering impactful technical solutions and mentoring.

🇨🇦 Canada – Remote

💵 CA$160k - CA$210k / year

⏰ Full Time

🔴 Lead

🧑‍💻 Full-stack Engineer