Principal Engineer, AI Inference Reliability

Job not on LinkedIn

October 30

Apply Now
Logo of Cerebras Systems

Cerebras Systems

Artificial Intelligence • Hardware • Healthcare Insurance

Cerebras Systems is a pioneering company that focuses on developing advanced AI hardware, specifically the Cerebras Wafer Scale Engine, which delivers unparalleled performance in AI inference, outperforming traditional GPU setups. Their cutting-edge technology enables organizations like Mayo Clinic and AlphaSense to run state-of-the-art AI models with remarkable speed and efficiency. With flexible deployment options including cloud and on-premises solutions, Cerebras is transforming the landscape of AI capabilities for innovative teams across various industries.

📋 Description

• Define and drive reliability strategy: establish SLOs and ensure alignment across engineering. • Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers. • Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents. • Architect for reliability and observability: influence system design for redundancy, durability, and debuggability. • Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection. • Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service. • Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights. • Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems.

🎯 Requirements

• Bachelor's or master's degree in computer science or related field. • 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems. • Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust. • Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture. • Excellent communication and cross-functional leadership skills. • Bonus: prior experience building large-scale AI infrastructure systems.

🏖️ Benefits

• Health insurance • 401(k) matching • Flexible work hours • Paid time off • Professional development opportunities

Apply Now

Similar Jobs

October 29

Software Architect leading technical design for patient registration systems in healthcare technology. Working with a distributed team to improve patient activation and healthcare outcomes.

Cloud

Java

SDLC

October 28

Principal Software Engineer developing risk management solutions on Brinqa's Cybersecurity Knowledge Platform. Collaborating with product teams for large enterprise customers while leading technical excellence.

Groovy

Java

Spring

Spring Boot

SpringBoot

October 28

Principal Software Engineer designing and delivering high-performance control software. Leading multi-faceted teams and ensuring product sustainability through lifecycle ownership.

October 28

Software Development Engineer collaborating on API tools & services in the API Management group at Kong. Responsible for developing software components to enhance API discoverability and documentation.

Distributed Systems

Docker

JavaScript

Kafka

Kubernetes

Microservices

Postgres

TypeScript

Go

October 28

Staff Developer solving complex technical challenges in Credit for KOHO, a financial services company. Focus on lending products, delivering impactful technical solutions and mentoring.

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com