Site Reliability Engineer

Job not on LinkedIn

November 13

Apply Now
Logo of The Leaflet

The Leaflet

API

The Leaflet is an open-source JavaScript library for building mobile-friendly interactive maps. It is lightweight (around 42 KB), designed for simplicity, performance and usability, and provides core mapping features such as tile layers, markers, vector layers, popups, and interaction handlers. Leaflet is highly extensible via a large plugin ecosystem, well-documented, and maintained by a broad community of contributors and organizations.

11 - 50 employees

🔌 API

📋 Description

• Maintain and improve the reliability, scalability, and performance of our Java-based application. • Responsible for managing and monitoring the applications and infrastructure. • Use the Grafana stack (Grafana, Loki, Prometheus) to ensure a high level of observability. • Implement robust monitoring, alerting, and logging solutions. • Ensure the availability, reliability, and performance of a high-traffic Java-based application in a distributed environment. • Troubleshoot and resolve complex issues in production and non-production environments. • Participate in both pre- and post-deployment performance testing and monitoring efforts. • Optimize Java application performance, ensuring efficient resource utilization and scaling. • Deploy and manage the Grafana stack to provide real-time monitoring, logging, and alerting. • Create and maintain dashboards, alerts, and logs for comprehensive monitoring of system health and performance. • Support the operations team’s incident response efforts and participate in post-mortems. • Document and share lessons learned from incidents.

🎯 Requirements

• Degree in computer science or a related field, or equivalent work experience • 2-3 years in SRE, DevOps, or similar Infrastructure roles • Experience managing large-scale, high-availability production systems • Track record of incident response and post-mortem processes • Experience with capacity planning and performance optimization • 1+ years hands-on experience managing production Kubernetes clusters • Deep understanding of k8s architecture, networking, storage, and security • Experience with cluster scaling (Karpenter), upgrades, and multi-cluster management • Proficiency with kubectl, Helm, and Kubernetes operators • Container orchestration and troubleshooting knowledge • Expertise with the Grafana stack for dashboards, alerting, and visualization • Hands-on experience with Grafana Alloy for telemetry data collection • Proficiency in PromQL • Experience with Loki for log aggregation and analysis • Experience building comprehensive monitoring and alerting strategies • Hands-on experience managing Java-based applications in large-scale, distributed environments, with a focus on JVM tuning and application optimization. • Cloud Platform expertise (AWS, GCP, or Azure) • Familiarity with infrastructure as code (IAC) tools like Terraform/Terragrunt or Ansible. • ArgoCD proficiency for GitOps workflows and continuous deployment • Scripting abilities in Bash, Python, or Go • Experience with CI/CD pipelines and automation tools • Configuration Management and deployment automation • Strong troubleshooting skills, with a proactive approach to diagnosing and resolving performance bottlenecks. • Proven experience in on-call rotations, incident response, and root cause analysis. • Strong communication skills (both written and verbal), positive attitude, and ability to receive constructive feedback.

🏖️ Benefits

• Competitive pay and benefits. • Start-up culture backed by a secure, global brand. • Flexible vacation allowance. • Internal growth and development.

Apply Now

Similar Jobs

November 12

DevOps engineer at NordPass developing secure, scalable systems for password management. Collaborating with teams to enhance cybersecurity tools impacting millions of users worldwide.

AWS

Cloud

Kubernetes

Linux

Python

Terraform

Go

November 10

Site Reliability Engineer II improving automation and efficiencies at Akamai. Specializing in Linux systems administration and software development for customer-facing applications.

Distributed Systems

Grafana

Jenkins

Prometheus

Python

SaltStack

Terraform

November 10

Site Reliability Engineer devising automation solutions to enhance efficiencies at Akamai's Compute products. Collaborating with internal teams for operational excellence and customer support.

Ansible

Chef

Grafana

Linux

Packer

Prometheus

Puppet

Python

SaltStack

November 9

DevOps Engineer at VarSome.com responsible for building and managing cloud infrastructure. Collaborating with software engineers to ensure reliable application deployment and performance.

Ansible

Cloud

Consul

DNS

Docker

ElasticSearch

Firewalls

Google Cloud Platform

Grafana

Linux

Packer

Python

TCP/IP

Terraform

Unix

November 6

Senior Site Reliability Engineer ensuring smooth operation of critical infrastructure and applications for clients. Working with Jenkins, Kubernetes, AWS, and Terraform in a fully remote role.

AWS

Cloud

DNS

Grafana

Groovy

Jenkins

Kubernetes

Linux

Prometheus

Python

TCP/IP

Terraform

Go

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com