Senior Site Reliability Engineer, Core AI Infrastructure

🔥 3 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Coinbase

Coinbase

1001 - 5000 employees

Founded 2012

₿ Crypto

💸 Finance

💳 Fintech

💰 $21.4M Post-IPO Equity on 2022-11

Crypto • Finance • Fintech

Coinbase is a leading cryptocurrency exchange platform that allows individuals and institutions to buy, sell, and trade various crypto assets such as Bitcoin and Ethereum. The company offers advanced trading tools, institutional solutions, and a self-hosted wallet for storing and managing cryptocurrencies. With a strong focus on security and transparency, Coinbase provides a trusted platform used by millions globally. It supports various features including staking, earning rewards, and spending crypto through their cards. Additionally, Coinbase provides developer tools and APIs for building onchain applications, making it a comprehensive hub for engaging in the crypto economy.

📋 Description

• Own the reliability, monitoring, and incident response lifecycle for AI infrastructure services, including on-call support for AWS deployment pipelines, root cause analysis, and blameless retros. • Build automation and tooling to streamline operational IT workflows, eliminate manual tasks, and improve deployment velocity across CI/CD frameworks and Kubernetes environments. • Partner with the Coinbase Infrastructure team to extend CI/CD frameworks supporting IT services and enterprise network platforms, and with Security and Compliance to integrate surveillance tooling into deployment pipelines. • Strengthen observability and documentation standards across IT engineering by defining metrics, implementing monitoring solutions, and maintaining technical documentation that sets a standard of excellence. • Develop full-stack applications that power internal AI products and infrastructure with Go or Python.

🎯 Requirements

• 5+ years of experience automating and supporting cloud infrastructure (AWS) and network environments • Proven experience deploying, managing, and troubleshooting containerized workloads using Docker and Kubernetes in production environments • Proficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go) and version control workflows using Git-based CI/CD pipelines • Track record of leading incident response in environments with strict SLAs, including root cause analysis, blameless retros, and measurable reliability improvements • Utilizes generative AI responsibly, maintaining human oversight to deliver business-ready outputs and drive measurable improvements in workflow efficiency, cost, and quality.

🏖️ Benefits

• medical • dental • vision • 401(k)

Apply Now

Similar Jobs

🔥 1 hour ago

Aya Healthcare

5001 - 10000

⚕️ Healthcare Insurance

🎯 Recruiter

Lead the SRE team at Aya Healthcare for enhancing product reliability and operational efficiency. Manage incident responses and AI-native operations for a top healthcare workforce solutions provider.

AWS

Azure

Google Cloud Platform

🔥 2 hours ago

Offchain Labs

11 - 50

₿ Crypto

🌐 Web 3

Site Reliability Engineer at Offchain leading a movement in blockchain scalability and security. Tackling real-world challenges and transforming interactions with decentralized applications.

AWS

Azure

Cloud

Google Cloud Platform

Linux

Python

Shell Scripting

Go

🔥 3 hours ago

BeyondTrust

1001 - 5000

🔒 Cybersecurity

Cloud Operations Engineer monitoring, maintaining, and responding to incidents for BeyondTrust Cloud Service. Collaborating across teams to ensure service health and handling cloud environments.

AWS

Azure

Cloud

Distributed Systems

Docker

JavaScript

Kubernetes

Linux

Python

Terraform

🔥 5 hours ago

MKS2 Technologies

201 - 500

🤝 B2B

🔒 Cybersecurity

Site Reliability Systems Engineer working with monitoring tools to enhance VA's infrastructure reliability. Collaborating across teams to resolve outages and improve service quality for veterans.

AWS

Azure

Cloud

Java

JavaScript

Linux

Oracle

ServiceNow

Splunk

Unix

🔥 5 hours ago

VAST Data

501 - 1000

DevOps Engineer developing tools to enhance efficiency for the Sales Engineering team at an AI infrastructure company. Responsible for managing AWS services and backend applications.

Angular

AWS

DNS

Docker

EC2

GraphQL

JavaScript

Linux

MongoDB

Node.js

SCSS

Shell Scripting

Unix