Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

🕒 March 10

🇺🇸 United States – Remote

💵 $150k - $220k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

info
Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Deepgram

Deepgram

51 - 200 employees

Founded 2015

🤖 Artificial Intelligence

☁️ SaaS

🔌 API

💰 $47M Series B on 2022-11

Artificial Intelligence • SaaS • API

Deepgram is a leading voice AI company that provides powerful APIs for speech-to-text, text-to-speech, and language understanding applications. Their platform enables developers to build sophisticated voice AI solutions for use cases such as contact centers, medical transcription, conversational AI, and more. Known for unmatched accuracy, speed, and cost-effectiveness, Deepgram's technology is trusted by top enterprises and startups worldwide. By offering real-time and highly accurate transcription capabilities, Deepgram helps businesses gain insights from voice data, making it an essential tool for transforming voice interactions.

📋 Description

• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments

🎯 Requirements

• 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE) • Proven, hands-on experience building and managing production infrastructure with Terraform • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management • Strong scripting and automation skills (e.g., Python, Go, Bash)

🏖️ Benefits

• Medical, dental, vision benefits • Annual wellness stipend • Mental health support • Life, STD, LTD Income Insurance Plans • Unlimited PTO • Generous paid parental leave • Flexible schedule • 12 Paid US company holidays • Quarterly personal productivity stipend • One-time stipend for home office upgrades • 401(k) plan with company match • Tax Savings Programs • Learning / Education stipend • Participation in talks and conferences • Employee Resource Groups • AI enablement workshops / sessions

Apply Now

Similar Jobs

🕒 March 9

Elligint Health

51 - 200

⚕️ Healthcare Insurance

🧬 Biotechnology

DevOps Engineer optimizing Windows-based web services in AWS for healthcare organization. Collaborating on file processing and ensuring compliance with healthcare regulations.

🕒 March 7

Inetum

10,000+ employees

🤝 B2B

🏢 Enterprise

☁️ SaaS

Expert DevOps / DevSecOps supporting Generative AI initiatives at Inetum for digital transformation in the United States. Designing high-value GenAI use cases and integrating new tools and practices.

🗣️🇫🇷 French Required

🕒 March 7

Flywire

1001 - 5000

💸 Finance

💳 Fintech

Manager II of Site Reliability Engineering at Flywire driving reliability, automation, and performance in cloud infrastructure. Collaborating with Engineering teams to achieve production excellence in a global environment.

🕒 March 5

NOVA Corporation

1 - 10

🤝 B2B

☁️ SaaS

DevSecOps & Cloud Operations Engineer at North Stone supporting cloud automation, monitoring, and security. Managing CI/CD pipelines and optimizing system performance across cloud platforms.

🕒 March 5

Akamai Technologies

5001 - 10000

🔒 Cybersecurity

Senior II DevOps Engineer developing and maintaining cloud infrastructures and applications for FedRAMP compliance. Collaborating with teams on network security projects and enhancing product deployment.