Staff Site Reliability Engineer, Core AI Infrastructure

🔥 3 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Coinbase

Coinbase

1001 - 5000 employees

Founded 2012

₿ Crypto

💸 Finance

💳 Fintech

💰 $21.4M Post-IPO Equity on 2022-11

Crypto • Finance • Fintech

Coinbase is a leading cryptocurrency exchange platform that allows individuals and institutions to buy, sell, and trade various crypto assets such as Bitcoin and Ethereum. The company offers advanced trading tools, institutional solutions, and a self-hosted wallet for storing and managing cryptocurrencies. With a strong focus on security and transparency, Coinbase provides a trusted platform used by millions globally. It supports various features including staking, earning rewards, and spending crypto through their cards. Additionally, Coinbase provides developer tools and APIs for building onchain applications, making it a comprehensive hub for engaging in the crypto economy.

📋 Description

• Own the reliability, monitoring, and incident response lifecycle for AI infrastructure services, including on-call support for AWS deployment pipelines, root cause analysis, and blameless retros. • Build automation and tooling to streamline operational IT workflows, eliminate manual tasks, and improve deployment velocity across CI/CD frameworks and Kubernetes environments. • Partner with the Coinbase Infrastructure team to extend CI/CD frameworks supporting IT services and enterprise network platforms, and with Security and Compliance to integrate surveillance tooling into deployment pipelines. • Strengthen observability and documentation standards across IT engineering by defining metrics, implementing monitoring solutions, and maintaining technical documentation that sets a standard of excellence. • Develop full-stack applications that power internal AI products and infrastructure with Go or Python.

🎯 Requirements

• 8+ years of experience automating and supporting cloud infrastructure (AWS) and network environments, with hands-on use of infrastructure-as-code tools (Terraform, Ansible, Chef, Puppet, or Salt). • Proven experience deploying, managing, and troubleshooting containerized workloads using Docker and Kubernetes in production environments. • Proficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go) and version control workflows using Git-based CI/CD pipelines. • Track record of leading incident response in environments with strict SLAs, including root cause analysis, blameless retros, and measurable reliability improvements. • Utilizes generative AI responsibly, maintaining human oversight to deliver business-ready outputs and drive measurable improvements in workflow efficiency, cost, and quality.

🏖️ Benefits

• medical • dental • vision • 401(k)

Apply Now

Similar Jobs

🔥 1 hour ago

Aya Healthcare

5001 - 10000

⚕️ Healthcare Insurance

🎯 Recruiter

Lead the SRE team at Aya Healthcare for enhancing product reliability and operational efficiency. Manage incident responses and AI-native operations for a top healthcare workforce solutions provider.

AWS

Azure

Google Cloud Platform

🔥 5 hours ago

MKS2 Technologies

201 - 500

🤝 B2B

🔒 Cybersecurity

Site Reliability Systems Engineer working with monitoring tools to enhance VA's infrastructure reliability. Collaborating across teams to resolve outages and improve service quality for veterans.

AWS

Azure

Cloud

Java

JavaScript

Linux

Oracle

ServiceNow

Splunk

Unix

🕒 2 days ago

NVIDIA

10,000+ employees

🤖 Artificial Intelligence

🎮 Gaming

Site Reliability and Software Engineering leader managing NVIDIA's DGX Cloud computing services. Overseeing team operations and driving technical project success in innovative environment.

Cloud

Distributed Systems

Linux

SDLC

Unix

🕒 2 days ago

Leidos

10,000+ employees

🔒 Cybersecurity

🔬 Science

DevSecOps Engineer automating delivery infrastructure for mission-critical software at Leidos. Building CI/CD pipelines and maintaining security compliance in cloud environments.

Cloud

Kubernetes

Linux

Python

Terraform

🕒 3 days ago

Lantana Consulting Group

51 - 200

⚕️ Healthcare Insurance

☁️ SaaS

🏛️ Government

DevOps Manager at Lantana Consulting Group managing DevSecOps practices and leading technical teams. Ensuring software delivery alignment with security frameworks and federal IT requirements.