Post a Job Affiliates

Search Remote Jobs

The Leaflet

Website LinkedIn All Job Openings

11 - 50 employees

💼 Consulting

⚖️ Legal

🔌 API

Consulting • Legal • API

The Leaflet is an open-source JavaScript library for building mobile-friendly interactive maps. It is lightweight (around 42 KB), designed for simplicity, performance and usability, and provides core mapping features such as tile layers, markers, vector layers, popups, and interaction handlers. Leaflet is highly extensible via a large plugin ecosystem, well-documented, and maintained by a broad community of contributors and organizations.

Senior Site Reliability Engineer

Job not on LinkedIn

🕒 June 10

🐊 Florida – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Cloud

Google Cloud Platform

Grafana

Java

Kubernetes

Prometheus

Python

Terraform

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

The Leaflet

Website LinkedIn All Job Openings

11 - 50 employees

💼 Consulting

⚖️ Legal

🔌 API

Consulting • Legal • API

📋 Description

• Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment. • Troubleshoot and resolve complex issues across production and non-production environments. • Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance. • Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling. • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting. • Implement and refine observability strategies that enhance visibility into application and infrastructure health. • Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring. • Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction. • Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization. • Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval. • Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents. • Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems. • Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving. • Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization. • Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence. • Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents. • Document and share lessons learned, contributing to a culture of continuous improvement. • Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements. • Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language. • Measure and report on toil reduction metrics to quantify the impact of automation initiatives. • Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities. • Collaborate with DevOps and NOC teams to support the application platform. • Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders. • Provide feedback on application performance, potential improvements, and observability metrics.

🎯 Requirements

• Degree in Computer Science or a related field, or equivalent professional experience. • 5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems. • 3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security. • Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management. • Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting. • Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection. • Proficiency in PromQL and experience with Loki for log aggregation and analysis. • Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization. • Cloud platform expertise (AWS preferred; GCP or Azure also valued). • Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible. • ArgoCD proficiency for GitOps workflows and continuous deployment. • Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation. • Proven track record with on-call rotations, incident response, and root cause analysis. • 1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context. • Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks. • Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines. • Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent). • Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples. • Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows. • Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents. • Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems.

🏖️ Benefits

• Competitive pay and benefits • Flexible vacation allowance • A hybrid / remote working environment • Startup culture backed by a secure, global brand

Apply Now

Similar Jobs

GTM DevOps Engineer

🕒 June 9

ClickUp

1001 - 5000

☁️ SaaS

⚡ Productivity

🏢 Enterprise

Website LinkedIn All Job Openings

GTM DevOps Engineer at ClickUp responsible for reliability and automation of Go-To-Market technology stack. Collaborating with developers to build CI/CD pipelines and manage cloud infrastructure.

🇺🇸 United States – Remote

💵 $160k - $210k / year

💰 $400M Series C - ClickUp on 2021-10

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Cloud

ERP

Google Cloud Platform

Jenkins

Apply

View Job

DevSecOps Engineer

🕒 June 9

DMI (Digital Management, LLC)

1001 - 5000

💼 Consulting

🏥 Healthcare

📦 Logistics

Website LinkedIn All Job Openings

Mid-level DevSecOps Engineer supporting hybrid cloud infrastructure for federal agency client. Focus on automation, security, and CI/CD practices.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Cloud

Docker

Kubernetes

Python

Terraform

Vault

Apply

View Job

Cloud DevOps Engineer – WAF, Network Management

🕒 June 9

OneStream Software

1001 - 5000

💸 Finance

🏢 Enterprise

Website LinkedIn All Job Openings

Cloud DevOps Engineer responsible for designing and supporting cloud infrastructure at OneStream. Managing automation and optimizing cloud environments within SaaS offerings.

🇺🇸 United States – Remote

💵 $99k - $128.5k / year

💰 Series B on 2021-04

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Google Cloud Platform

Grafana

Kubernetes

MS SQL Server

OpenShift

Prometheus

SQL

Terraform

Apply

View Job