Senior Site Reliability Engineer

Job not on LinkedIn

12 hours ago

Apply Now
Logo of BrightHire

BrightHire

HR Tech • AI • Recruitment

BrightHire is an interview intelligence platform that uses AI to enhance the hiring process for companies. By providing tools for interview planning, automated note-taking, and actionable insights, BrightHire helps organizations improve the quality and efficiency of their hiring efforts. The platform focuses on structured interviews, reduces bias, and supports diverse hiring practices, ultimately enabling teams to make data-driven decisions and nurture a better candidate experience.

11 - 50 employees

Founded 2019

👥 HR Tech

🎯 Recruiter

💰 $20.5M Series B on 2021-10

📋 Description

• You will own the end-to-end reliability and performance of many of our most critical systems. • Working in lockstep with Product and Engineering, you will design, build, and refine the platform that our application and AI features run on, from Kubernetes and databases through CI/CD and observability. • You will focus on keeping our systems fast, reliable, and easy for developers to work with. • You will work on real infrastructure that supports features people use every day—things like: • Continuing to improve and iterate on our observability stack that includes Kibana, Grafana, OTel, and Elastic. • Database performance improvements by analyzing slow and high-volume queries, tuning indexes, optimizing query patterns and timing, and recommending schema and code changes to keep QPS and latency low. • Kubernetes improvements and upgrades, including deploying new services, improving resource utilization, tightening security, and standardizing deployment patterns across teams. • Improving CI/CD pipelines for both backend and frontend services so engineers can ship quickly and safely, with clear feedback loops, fast build times, and reliable rollbacks. • Enhancing the local developer experience so that running and debugging the app locally feels fast, consistent, and representative of production. • Helping improve our CI/CD and observability for our ML pipeline and models, bringing MLOps best practices into our existing infrastructure.

🎯 Requirements

• You have real-world experience running production systems and doing SRE, Platform, or DevOps work for web applications or APIs. • You are comfortable working across Kubernetes, CI/CD, databases, and backend services, and you enjoy owning problems end to end. • You have strong experience with Kubernetes in production environments, including cluster upgrades, workload deployments, scaling, and debugging. • You have experience with observability stacks (such as Elasticsearch and Kibana, Prometheus, Grafana, or similar) and can lead efforts like upgrading Kibana to new major versions and improving logs, metrics, and dashboards. • You have worked deeply with relational databases and SQL, know how to profile slow queries, design and tune indexes, and work with engineers to adjust query patterns, timing, and frequency to improve performance. • You are comfortable in at least one backend language (i.e. Python) and can read and modify application code to support infra and performance improvements. • You have experience improving CI/CD pipelines, including build and test speed, deployment workflows, and release strategies (such as blue/green or canary). • You have worked with infrastructure-as-code tools or similar patterns to manage environments in a repeatable way. • You think deeply about developer experience and reliability and use both metrics and empathy to guide your decisions. • You care about security, resiliency, and cost as integral aspects of the systems you build and manage. • You move fast and independently, but you know when to pull in teammates for pairing, reviews, or cross-team alignment.

🏖️ Benefits

• Flexible working hours • Professional development opportunities • Remote work options • Strong observability

Apply Now

Similar Jobs

14 hours ago

Senior Probabilistic Risk and Reliability Engineer at GE Vernova focusing on developing risk assessment technologies and methodologies for nuclear plants. Collaborating with multidisciplinary teams to enhance safety and operational reliability.

15 hours ago

Lead Site Reliability Engineer managing GCP infrastructure for Health Catalyst. Collaborate across teams to improve system reliability and automate processes.

Cloud

Google Cloud Platform

Jenkins

Kubernetes

Python

Yesterday

DevOps Engineer with cloud infrastructure responsibilities at SmithRx, a Health-Tech company. Join a mission-driven team dedicated to cost-effective pharmacy solutions with innovative technology.

Amazon Redshift

AWS

BigQuery

Cloud

Groovy

Kubernetes

NoSQL

Perl

Postgres

Python

Redis

Ruby

SQL

Terraform

Go

4 days ago

AWS DevOps Engineer designing, building, and securing cloud infrastructure for Gabb's device ecosystem. This role combines hands-on and architectural responsibilities and ensures environments are scalable and secure.

AWS

Cloud

Docker

EC2

Jenkins

Kubernetes

Prometheus

Python

Terraform

Go

4 days ago

Senior DevSecOps Engineer focusing on security and scalability for a rapidly growing healthcare company. Support a remote workforce aiming for multi-region resilience.

Cloud