Senior Site Reliability Engineer

Stelle nicht auf LinkedIn

🕒 vor 3 Tagen

🗣️🇺🇸🇬🇧 Englisch erforderlich

Jetzt Bewerben
Ähnliche Remote-Jobs finden

📊 Überprüfen Sie Ihre Lebenslauf-Bewertung für diese Stelle

Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

Logo of The Leaflet

The Leaflet

11 - 50 Mitarbeiter

🔌 API

API

Leaflet ist eine Open-Source-JavaScript-Bibliothek zur Erstellung von mobilfreundlichen interaktiven Karten. Sie ist leichtgewichtig (ca. 42 KB), für Einfachheit, Leistung und Benutzerfreundlichkeit konzipiert, und bietet grundlegende Kartierungsfunktionen wie Kachelebenen, Marker, Vektorebenen, Pop-ups und Interaktionshandler. Leaflet ist durch ein großes Plugin-Ökosystem hochgradig erweiterbar, gut dokumentiert und wird von einer breiten Gemeinschaft von Mitwirkenden und Organisationen gepflegt.

Beschreibung

• Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment. • Troubleshoot and resolve complex issues across production and non-production environments. • Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance. • Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling. • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting. • Implement and refine observability strategies that enhance visibility into application and infrastructure health. • Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring. • Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction. • Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization. • Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval. • Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents. • Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems. • Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving. • Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization. • Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence. • Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents. • Document and share lessons learned, contributing to a culture of continuous improvement. • Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements. • Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language. • Measure and report on toil reduction metrics to quantify the impact of automation initiatives. • Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities. • Collaborate with DevOps and NOC teams to support the application platform. • Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders. • Provide feedback on application performance, potential improvements, and observability metrics.

🎯 Anforderungen

• Degree in Computer Science or a related field, or equivalent professional experience. • 5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems. • 3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security. • Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management. • Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting. • Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection. • Proficiency in PromQL and experience with Loki for log aggregation and analysis. • Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization. • Cloud platform expertise (AWS preferred; GCP or Azure also valued). • Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible. • ArgoCD proficiency for GitOps workflows and continuous deployment. • Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation. • Proven track record with on-call rotations, incident response, and root cause analysis. • 1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context. • Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks. • Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines. • Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent). • Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples. • Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows. • Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents. • Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems.

🏖️ Vorteile

• Competitive pay and benefits • Flexible vacation allowance • A hybrid / remote working environment • Startup culture backed by a secure, global brand

Jetzt Bewerben

Ähnliche Jobs

🕒 vor 3 Tagen

HavocAI

11 - 50

🤖 Künstliche Intelligenz

🔐 Sicherheit

🔧 Hardware

Senior Site Reliability Engineer at HavocAI responsible for reliability architecture and incident management. Ensuring performance, resilience, and operational maturity of mission-critical cloud services.

🇺🇸 Vereinigte Staaten – Remote

💵 $150.000 - $185.000 / Jahr

💰 Seed Round im 2024-09

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 3 Tagen

Ad Hoc LLC

501 - 1000

🏛️ Regierung

🤖 Künstliche Intelligenz

🔌 API

Senior DevOps Engineer at Ad Hoc creating scalable digital services and improving software engineering processes. Collaborating with federal agencies to enhance service delivery through technology.

🇺🇸 Vereinigte Staaten – Remote

💵 $125.000 - $140.000 / Jahr

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 3 Tagen

Generac

5001 - 10000

⚡ Energie

🔧 Hardware

Senior DevSecOps Engineer at Generac managing cloud services and ensuring security and compliance in data handling. Leading efforts in secure cloud infrastructure design and integrating security in development processes.

🇺🇸 Vereinigte Staaten – Remote

💵 $145.000 - $185.000 / Jahr

💰 €200.000.000 Grant im 2024-07

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

info

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 3 Tagen

RethinkFirst

51 - 200

⚕️ Krankenversicherung

🤖 Künstliche Intelligenz

📚 Bildung

DevOps Engineer designing and managing cloud environments and automation tools for RethinkFirst. Delivering CI/CD pipelines, quality code, and incident management in a fast-paced environment.

🇺🇸 Vereinigte Staaten – Remote

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 3 Tagen

athenahealth

5001 - 10000

⚕️ Krankenversicherung

☁️ SaaS

🤖 Künstliche Intelligenz

Lead Site Reliability Engineer enhancing observability and telemetry platform for athenahealth's cloud infrastructure. Collaborating with engineering teams to improve reliability and operational efficiency.

🇺🇸 Vereinigte Staaten – Remote

💵 $143.000 - $243.000 / Jahr

💰 Post-IPO Equity im 2017-05

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

info

🗣️🇺🇸🇬🇧 Englisch erforderlich