Staff Software Engineer – Databases SRE

501 - 1000 employees

Founded 2014

🏢 Enterprise

☁️ SaaS

🤖 Artificial Intelligence

Enterprise • SaaS • Artificial Intelligence

Grafana Labs is a company that specializes in open-source observability technologies and solutions. It offers a comprehensive suite of tools for logging, metrics, tracing, and profile management with products like Grafana, Loki, Tempo, and Mimir. Their offerings are designed to help businesses visualize, monitor, and alert on data from various sources, providing capabilities such as anomaly detection, root cause analysis, and service level objective management using AI/ML insights. Grafana Labs provides both cloud-based and self-managed solutions, ideal for infrastructure, application, and frontend observability. Additionally, their platform supports integration with various data sources like Prometheus and OpenTelemetry, making them a key player in the observability and infrastructure monitoring space.

Staff Software Engineer – Databases SRE

🔥 0 minutes ago

🇬🇧 United Kingdom – Remote

💵 £104k - £124.8k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🇬🇧 UK Skilled Worker Visa Sponsor

AWS

Azure

Cloud

Google Cloud Platform

Grafana

Java

Kubernetes

Linux

Python

Terraform

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Grafana Labs

501 - 1000 employees

Founded 2014

🏢 Enterprise

☁️ SaaS

🤖 Artificial Intelligence

Enterprise • SaaS • Artificial Intelligence

📋 Description

• Support the highest value Grafana Cloud customers by ensuring database reliability • Partner closely with product engineering squads • Own production reliability for high-SLA customer environments • Design and implement automation for reliability practices • Ensure customers meet SLO targets • Lead incident response and reviews • Contribute to design docs and code reviews • Build automation to eliminate toil • Improve alert quality and reduce escalations

🎯 Requirements

• 8+ years engineering experience, 4+ in SRE/CRE/production engineering • Strong Kubernetes experience in AWS, GCP, or Azure • Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.) • Strong technical leadership experience • Experience operating multi-tenant systems in production • Strong experience designing and implementing SLOs • Experience with one or more programming languages (e.g. Go, Python, Java) • Experience with Linux internals • Excellent problem-solving skills • Experience in incident response & post-incident reviews • Ability to reason about performance, scaling, and failure modes • Comfort with autonomous work within an engineering team

🏖️ Benefits

• Equity • Bonus (if applicable) • 30 days of annual leave • Grafana Shutdown Days • In-Person onboarding

Apply Now

Similar Jobs

Staff SRE, Ads

🕒 June 19

Reddit, Inc.

501 - 1000

👥 B2C

📱 Media

🌍 Social Impact

Staff Site Reliability Engineer leading reliability initiatives across Ads domains at Reddit. Working to improve reliability, scalability, and operational efficiency in Reddit's advertising ecosystem.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Distributed Systems

Linux

Python

DevOps Reliability Engineer

🕒 June 11

Advanced Solutions International, Inc.

201 - 500

🤝 B2B

🤝 Non-profit

DevOps Reliability Engineer ensuring performance, scalability, and reliability of Azure-based SaaS platform at ASI. Collaborating with engineering teams to improve system efficiency and resilience.

🇬🇧 United Kingdom – Remote

💰 Venture Round on 2022-01

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Azure

Cloud

SQL

Principal DevOps Engineer

🕒 May 26

Intermedia Cloud Communications

1001 - 5000

🤝 B2B

🏢 Enterprise

☁️ SaaS

Principal DevOps Engineer serving as technical lead and architect for infrastructure, automation, and deployments in cloud communications provider. Focused on reliability, standards, and cross-platform initiatives.

🇬🇧 United Kingdom – Remote

💰 Venture Round on 2017-02

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

DNS

Docker

Kubernetes

Linux

NGINX

Prometheus

RabbitMQ

Redis

Terraform

Staff Site Reliability Engineer – Site Experience

🕒 May 25

Reddit, Inc.

501 - 1000

👥 B2C

📱 Media

🌍 Social Impact

Staff Site Reliability Engineer leading reliability initiatives for critical user facing systems at Reddit. Driving operational excellence and performance for large-scale distributed systems.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Distributed Systems

Linux

Python

Principal Platform Infrastructure Engineer – SRE Enablement

🕒 May 12

Menlo Security Inc.

201 - 500

🔒 Cybersecurity

🏢 Enterprise

Principal Platform Infrastructure Engineer designing and operating Menlo Security's infrastructure platform across multiple environments. Collaborating with global teams and leveraging cloud-native technologies like Google Kubernetes Engine and Terraform.

🇬🇧 United Kingdom – Remote

💰 $100M Series E on 2020-11

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

DNS

Google Cloud Platform

Grafana

Kubernetes

Prometheus

Python

TCP/IP

Terraform