Site Reliability Engineer – AI &amp; ML Infrastructure, Kubernetes, AWS, Terraform

51 - 200 employees

Founded 2015

🤖 Artificial Intelligence

☁️ SaaS

🔌 API

💰 $47M Series B on 2022-11

Artificial Intelligence • SaaS • API

Deepgram is a leading voice AI company that provides powerful APIs for speech-to-text, text-to-speech, and language understanding applications. Their platform enables developers to build sophisticated voice AI solutions for use cases such as contact centers, medical transcription, conversational AI, and more. Known for unmatched accuracy, speed, and cost-effectiveness, Deepgram's technology is trusted by top enterprises and startups worldwide. By offering real-time and highly accurate transcription capabilities, Deepgram helps businesses gain insights from voice data, making it an essential tool for transforming voice interactions.

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

🕒 March 10

🇺🇸 United States – Remote

💵 $150k - $220k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Kubernetes

Python

Terraform

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Deepgram

51 - 200 employees

Founded 2015

🤖 Artificial Intelligence

☁️ SaaS

🔌 API

💰 $47M Series B on 2022-11

Artificial Intelligence • SaaS • API

📋 Description

• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments

🎯 Requirements

• 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE) • Proven, hands-on experience building and managing production infrastructure with Terraform • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management • Strong scripting and automation skills (e.g., Python, Go, Bash)

🏖️ Benefits

• Medical, dental, vision benefits • Annual wellness stipend • Mental health support • Life, STD, LTD Income Insurance Plans • Unlimited PTO • Generous paid parental leave • Flexible schedule • 12 Paid US company holidays • Quarterly personal productivity stipend • One-time stipend for home office upgrades • 401(k) plan with company match • Tax Savings Programs • Learning / Education stipend • Participation in talks and conferences • Employee Resource Groups • AI enablement workshops / sessions

Apply Now

Similar Jobs

Expert DevOps, DevSecOps, GenAI

🕒 March 7

Inetum

10,000+ employees

🤝 B2B

🏢 Enterprise

☁️ SaaS

Expert DevOps / DevSecOps supporting Generative AI initiatives at Inetum for digital transformation in the United States. Designing high-value GenAI use cases and integrating new tools and practices.

🇺🇸 United States – Remote

💰 Post-IPO Equity on 2007-03

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇫🇷 French Required

Cloud

Open Source

Site Reliability Engineering Manager

🕒 March 7

Flywire

1001 - 5000

💸 Finance

💳 Fintech

Manager II of Site Reliability Engineering at Flywire driving reliability, automation, and performance in cloud infrastructure. Collaborating with Engineering teams to achieve production excellence in a global environment.

🇺🇸 United States – Remote

💵 $160k - $200k / year

💰 $60M Series F on 2021-03

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Senior DevOps Engineer

🕒 March 5

Akamai Technologies

5001 - 10000

🔒 Cybersecurity

Senior II DevOps Engineer developing and maintaining cloud infrastructures and applications for FedRAMP compliance. Collaborating with teams on network security projects and enhancing product deployment.

🇺🇸 United States – Remote

💵 $112.5k - $202.5k / year

💰 Post-IPO Equity on 2001-07

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Cloud

Docker

Google Cloud Platform

Jenkins

Kubernetes

Linux

Microservices

Python

Terraform

VMware

Design and Release Engineer, Glazing

🕒 March 4

ALTEN Technology USA

501 - 1000

🚀 Aerospace

⚡ Energy

Design and Release Engineer developing vehicle components and systems from concept to production at ALTEN Technology USA.

🇺🇸 United States – Remote

💵 $115k - $135k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

DevSecOps Engineer III

🕒 March 3

Kapitus

201 - 500

💸 Finance

💳 Fintech

🤝 B2B

Cloud DevSecOps Engineer III enhancing security for Kapitus through AWS solutions. Responsibilities include monitoring, programming, testing, and collaboration with developers.

🇺🇸 United States – Remote

💵 $117.8k - $189k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Distributed Systems

DynamoDB