Staff Database Reliability Engineer

Ähnliche Remote-Jobs finden

51 - 200 Mitarbeiter

Gegründet 2019

☁️ SaaS

⚡ Produktivität

🏢 Unternehmen

SaaS • Productivity • Enterprise

Scribe ist eine Plattform für Workflow-Automatisierung, die die Produktivität von Teams steigert, indem sie automatisch Schritt-für-Schritt-Anleitungen für interne Prozesse erstellt und teilt. Sie ist für Teams in den Bereichen Betrieb, Kundenservice und Personalwesen konzipiert und vereinfacht die Dokumentation, Schulung und Einarbeitung, indem sie KI nutzt, um standardisierte Betriebsverfahren, Schulungsmaterialien und Prozessübersichten zu erstellen. Scribe ermöglicht es Organisationen, ihr Wissen zu zentralisieren, Schulungszeiten zu verkürzen und die Einhaltung von Vorschriften zu verbessern, dank ihrer benutzerfreundlichen Funktionen und Integrationen über verschiedene Plattformen hinweg.

Staff Database Reliability Engineer

🕒 vor 1 Monat

🏄 California – Remote

💵 $225.000 - $250.000 / Jahr

⏰ Vollzeit

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

Amazon Redshift

AWS

BigQuery

Django

Kafka

Postgres

Python

RabbitMQ

Redis

SQL

Terraform

Jetzt Bewerben

📊 Überprüfen Sie Ihre Lebenslauf-Bewertung für diese Stelle

Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

Scribe

51 - 200 Mitarbeiter

Gegründet 2019

☁️ SaaS

⚡ Produktivität

🏢 Unternehmen

SaaS • Productivity • Enterprise

Beschreibung

• Own the data tier end-to-end • Design schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databases • Review migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraints • Catch N+1 patterns and missing select_related/prefetch_related in review • Establish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning) • Scale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emerge • Capacity planning as traffic and engineering throughput grow • Zero-downtime schema migrations and cutovers • Multi-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZs • Backups, PITR, failover testing, retention • Own the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake) • DMS task design and tuning, replication slot hygiene on the Postgres side • Schema evolution as Django migrations roll through — so a column rename doesn't silently break the warehouse at 6 AM • Parquet layout and partitioning, reliability of the Snowflake handoff • Automated checks that flag migrations likely to break downstream consumers • Drive observability across three complementary tools: pganalyze, CloudWatch, Honeycomb

🎯 Anforderungen

• Deep PostgreSQL - EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum. Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling) • Strong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) - predict the SQL a query will generate, spot N+1 problems on sight and how to control eager loading (joins vs. batched IN queries), column projection, aggregations, and subqueries • Single-region multi-AZ design - practical understanding of what it does and doesn't protect against • Production CDC experience, ideally AWS DMS - comfortable with logical replication, slot hygiene, schema evolution, and Parquet-based data lakes feeding Snowflake (or BigQuery/Redshift) • Hands-on with pganalyze (or Datadog DBM / Performance Insights / pg_stat_statements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high-cardinality tracing tool) - comfortable with OpenTelemetry and opinionated about what makes a trace useful • Real experience making AI coding and review tools useful for a team - writing AGENTS.md files, configuring review agents, versioning and iterating on prompts and configs • OpenSearch at scale - sizing, sharding, JVM tuning, rolling upgrades, snapshots • Production Redis - persistence tradeoffs, cluster mode, hot keys, thundering herds • At least one production message broker (SQS, RabbitMQ, Kafka) - delivery semantics, idempotency, failure modes • Strong automation and IaC background - real code (Python, Go, or similar) and Terraform • Track record leading cross-team initiatives, writing design docs that hold up, influencing without authority • Comfortable in a high-growth environment where the right answer for 50 engineers isn't the right answer for 100 • Pragmatic outlook during incidents - focused on preventing the next one

🏖️ Vorteile

• Some of the nicest and smartest teammates you’ll ever work with • Competitive salaries • Comprehensive healthcare benefits • Exciting and motivating equity • Flexible PTO • 401k • Parental Leave • Commuter Benefits (SF office employees) • WFH Stipend

Jetzt Bewerben

Ähnliche Jobs

Distinguished Site Reliability Engineer – Cloud

🕒 vor 1 Monat

NVIDIA

10.000+ Mitarbeiter

🤖 Künstliche Intelligenz

🎮 Gaming

Site Reliability Engineer at NVIDIA designing and maintaining large scale Kubernetes clusters. Ensuring system reliability and operational efficiency through automation and monitoring practices.

🇺🇸 Vereinigte Staaten – Remote

💵 $320.000 - $488.750 / Jahr

⏰ Vollzeit

🟠 Senior

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

🗣️🇺🇸🇬🇧 Englisch erforderlich

Cloud

Distributed Systems

Kubernetes

Linux

Perl

Python

Ruby

Staff Security Engineer, DevSecOps

🕒 vor 1 Monat

1Password

501 - 1000

🔒 Cybersecurity

☁️ SaaS

⚡ Produktivität

Staff Security Engineer leading DevSecOps within Corporate Security team at 1Password. Responsible for securing developer environments and overseeing GitHub security.

🇺🇸 Vereinigte Staaten – Remote

💵 $192.000 - $278.000 / Jahr

💰 €620.000.000 Series C im 2022-01

⏰ Vollzeit

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

🗣️🇺🇸🇬🇧 Englisch erforderlich

Python

Terraform

Staff DevOps Engineer

🕒 vor 1 Monat

Ad Hoc LLC

501 - 1000

🏛️ Regierung

🤖 Künstliche Intelligenz

🔌 API

Staff DevOps Engineer responsible for leading and improving cloud infrastructure for VA services. Collaborating with stakeholders and mentoring team members in software engineering best practices.

🇺🇸 Vereinigte Staaten – Remote

💵 $120.000 - $135.000 / Jahr

⏰ Vollzeit

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

Ansible

Terraform

Manager, DevOps

🕒 vor 1 Monat

National Resident Matching Program® (NRMP®)

11 - 50

📚 Bildung

⚕️ Krankenversicherung

Manager, DevOps responsible for software delivery practices and cloud platform oversight at NRMP. Leading release management and cross-functional team coordination in a complex environment.

🇺🇸 Vereinigte Staaten – Remote

💵 $157.600 - $173.700 / Jahr

⏰ Vollzeit

🟠 Senior

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

AWS

Cloud

SDLC

Vice President of Engineering – DevOps Engineering

🕒 vor 1 Monat

GitLab

1001 - 5000

🤖 Künstliche Intelligenz

🏢 Unternehmen

☁️ SaaS

Vice President of Engineering overseeing a globally distributed engineering organization at GitLab. Shaping a strategy for an AI-powered DevSecOps platform in a hands-on executive role.

🇺🇸 Vereinigte Staaten – Remote

💰 Secondary Market im 2020-11

⏰ Vollzeit

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich