Staff Database Reliability Engineer

51 - 200 employees

Founded 2019

☁️ SaaS

⚡ Productivity

🏢 Enterprise

SaaS • Productivity • Enterprise

Scribe is a workflow automation platform that enhances team productivity by automatically creating and sharing step-by-step guides for internal processes. Designed for operations, customer service, and HR teams, it simplifies documentation, training, and onboarding by leveraging AI to generate SOPs, training materials, and process overviews. Scribe enables organizations to centralize their knowledge, reduce training times, and improve compliance with its easy-to-use features and integrations across various platforms.

Staff Database Reliability Engineer

🕒 May 7

🏄 California – Remote

💵 $225k - $250k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Amazon Redshift

AWS

BigQuery

Django

Kafka

Postgres

Python

RabbitMQ

Redis

SQL

Terraform

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Scribe

51 - 200 employees

Founded 2019

☁️ SaaS

⚡ Productivity

🏢 Enterprise

SaaS • Productivity • Enterprise

📋 Description

• Own the data tier end-to-end • Design schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databases • Review migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraints • Catch N+1 patterns and missing select_related/prefetch_related in review • Establish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning) • Scale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emerge • Capacity planning as traffic and engineering throughput grow • Zero-downtime schema migrations and cutovers • Multi-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZs • Backups, PITR, failover testing, retention • Own the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake) • DMS task design and tuning, replication slot hygiene on the Postgres side • Schema evolution as Django migrations roll through — so a column rename doesn't silently break the warehouse at 6 AM • Parquet layout and partitioning, reliability of the Snowflake handoff • Automated checks that flag migrations likely to break downstream consumers • Drive observability across three complementary tools: pganalyze, CloudWatch, Honeycomb

🎯 Requirements

• Deep PostgreSQL - EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum. Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling) • Strong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) - predict the SQL a query will generate, spot N+1 problems on sight and how to control eager loading (joins vs. batched IN queries), column projection, aggregations, and subqueries • Single-region multi-AZ design - practical understanding of what it does and doesn't protect against • Production CDC experience, ideally AWS DMS - comfortable with logical replication, slot hygiene, schema evolution, and Parquet-based data lakes feeding Snowflake (or BigQuery/Redshift) • Hands-on with pganalyze (or Datadog DBM / Performance Insights / pg_stat_statements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high-cardinality tracing tool) - comfortable with OpenTelemetry and opinionated about what makes a trace useful • Real experience making AI coding and review tools useful for a team - writing AGENTS.md files, configuring review agents, versioning and iterating on prompts and configs • OpenSearch at scale - sizing, sharding, JVM tuning, rolling upgrades, snapshots • Production Redis - persistence tradeoffs, cluster mode, hot keys, thundering herds • At least one production message broker (SQS, RabbitMQ, Kafka) - delivery semantics, idempotency, failure modes • Strong automation and IaC background - real code (Python, Go, or similar) and Terraform • Track record leading cross-team initiatives, writing design docs that hold up, influencing without authority • Comfortable in a high-growth environment where the right answer for 50 engineers isn't the right answer for 100 • Pragmatic outlook during incidents - focused on preventing the next one

🏖️ Benefits

• Some of the nicest and smartest teammates you’ll ever work with • Competitive salaries • Comprehensive healthcare benefits • Exciting and motivating equity • Flexible PTO • 401k • Parental Leave • Commuter Benefits (SF office employees) • WFH Stipend

Apply Now

Similar Jobs

Distinguished Site Reliability Engineer – Cloud

🕒 May 4

NVIDIA

10,000+ employees

🤖 Artificial Intelligence

🎮 Gaming

Site Reliability Engineer at NVIDIA designing and maintaining large scale Kubernetes clusters. Ensuring system reliability and operational efficiency through automation and monitoring practices.

🇺🇸 United States – Remote

💵 $320k - $488.8k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Distributed Systems

Kubernetes

Linux

Perl

Python

Ruby

Staff Security Engineer, DevSecOps

🕒 May 3

1Password

501 - 1000

🔒 Cybersecurity

☁️ SaaS

⚡ Productivity

Staff Security Engineer leading DevSecOps within Corporate Security team at 1Password. Responsible for securing developer environments and overseeing GitHub security.

🇺🇸 United States – Remote

💵 $192k - $278k / year

💰 $620M Series C on 2022-01

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Python

Terraform

Staff DevOps Engineer

🕒 May 2

Ad Hoc LLC

501 - 1000

🏛️ Government

🤖 Artificial Intelligence

🔌 API

Staff DevOps Engineer responsible for leading and improving cloud infrastructure for VA services. Collaborating with stakeholders and mentoring team members in software engineering best practices.

🇺🇸 United States – Remote

💵 $120k - $135k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

Terraform

Manager, DevOps

🕒 May 2

National Resident Matching Program® (NRMP®)

11 - 50

📚 Education

⚕️ Healthcare Insurance

Manager, DevOps responsible for software delivery practices and cloud platform oversight at NRMP. Leading release management and cross-functional team coordination in a complex environment.

🇺🇸 United States – Remote

💵 $157.6k - $173.7k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

SDLC

Vice President of Engineering – DevOps Engineering

🕒 April 30

GitLab

1001 - 5000

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

Vice President of Engineering overseeing a globally distributed engineering organization at GitLab. Shaping a strategy for an AI-powered DevSecOps platform in a hands-on executive role.

🇺🇸 United States – Remote

💰 Secondary Market on 2020-11

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)