
501 - 1000 employees
Founded 2014
🏢 Enterprise
☁️ SaaS
🤖 Artificial Intelligence
Enterprise • SaaS • Artificial Intelligence
Grafana Labs is a company that specializes in open-source observability technologies and solutions. It offers a comprehensive suite of tools for logging, metrics, tracing, and profile management with products like Grafana, Loki, Tempo, and Mimir. Their offerings are designed to help businesses visualize, monitor, and alert on data from various sources, providing capabilities such as anomaly detection, root cause analysis, and service level objective management using AI/ML insights. Grafana Labs provides both cloud-based and self-managed solutions, ideal for infrastructure, application, and frontend observability. Additionally, their platform supports integration with various data sources like Prometheus and OpenTelemetry, making them a key player in the observability and infrastructure monitoring space.
🔥 18 hours ago
🇩🇪 Germany – Remote
💵 €109.7k - €131.7k / year
⏰ Full Time
🔴 Lead
⛑ DevOps & Site Reliability Engineer (SRE)
Improve your chances of getting an interview by checking your resume score before you apply.

501 - 1000 employees
Founded 2014
🏢 Enterprise
☁️ SaaS
🤖 Artificial Intelligence
Enterprise • SaaS • Artificial Intelligence
Grafana Labs is a company that specializes in open-source observability technologies and solutions. It offers a comprehensive suite of tools for logging, metrics, tracing, and profile management with products like Grafana, Loki, Tempo, and Mimir. Their offerings are designed to help businesses visualize, monitor, and alert on data from various sources, providing capabilities such as anomaly detection, root cause analysis, and service level objective management using AI/ML insights. Grafana Labs provides both cloud-based and self-managed solutions, ideal for infrastructure, application, and frontend observability. Additionally, their platform supports integration with various data sources like Prometheus and OpenTelemetry, making them a key player in the observability and infrastructure monitoring space.
• Partner closely with product engineering squads (embedded model) • Own production reliability for high-SLA and complex customer environments • Design and implement automation to scale our reliability practices • Ensuring our customers meet our SLO targets • Define and evolve per-tenant SLOs and reliability models • Proactively reduce SLO burn to prevent repeat incidents • Serve as a primary escalation point and on-call for relevant incidents • Lead customer-impacting incident response and post-incident reviews • Contribute to design docs and code reviews • Influence feature design to ensure production scalability and operability • Build automation to eliminate toil where needed • Improve alert quality and reduce noisy escalations
• 8+ years engineering experience, 4+ in SRE/CRE/production engineering • Strong Kubernetes experience in AWS, GCP, or Azure • Familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.) • Experience operating multi-tenant systems in production • Strong experience designing and implementing SLOs • Experience with one or more programming languages (e.g. Go, Python, Java, etc) • Experience with Linux operating systems internals • Knowledge of networking, cloud storage, and scaling • Excellent problem-solving and troubleshooting skills • Ability to reason about performance, scaling, and failure modes • Comfortable working within an engineering team • Ability to partner deeply with product engineering teams • Intellectually curious, default to transparency, possess a high bias towards action, and kind
• Equity • Bonus (if applicable) • 30 days annual leave including 3 Grafana Shutdown Days • Professional development opportunities • 100% Remote, Global Culture • Transparent Communication • Innovation-Driven • Open Source Roots • Empowered Teams • Career Growth Pathways • Approachable Leadership • In-Person onboarding
Apply Now🕒 May 26
Leading the DevOps team at 1inch to optimize decentralized finance infrastructure in a global team environment. Requires extensive experience in cloud and automation technologies.
AWS
Cloud
Google Cloud Platform
Kubernetes
Microservices
Terraform