
11 - 50 Mitarbeiter
🔌 API
API
Leaflet ist eine Open-Source-JavaScript-Bibliothek zur Erstellung von mobilfreundlichen interaktiven Karten. Sie ist leichtgewichtig (ca. 42 KB), für Einfachheit, Leistung und Benutzerfreundlichkeit konzipiert, und bietet grundlegende Kartierungsfunktionen wie Kachelebenen, Marker, Vektorebenen, Pop-ups und Interaktionshandler. Leaflet ist durch ein großes Plugin-Ökosystem hochgradig erweiterbar, gut dokumentiert und wird von einer breiten Gemeinschaft von Mitwirkenden und Organisationen gepflegt.
🕒 vor 3 Tagen
🗣️🇺🇸🇬🇧 Englisch erforderlich
Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

11 - 50 Mitarbeiter
🔌 API
API
Leaflet ist eine Open-Source-JavaScript-Bibliothek zur Erstellung von mobilfreundlichen interaktiven Karten. Sie ist leichtgewichtig (ca. 42 KB), für Einfachheit, Leistung und Benutzerfreundlichkeit konzipiert, und bietet grundlegende Kartierungsfunktionen wie Kachelebenen, Marker, Vektorebenen, Pop-ups und Interaktionshandler. Leaflet ist durch ein großes Plugin-Ökosystem hochgradig erweiterbar, gut dokumentiert und wird von einer breiten Gemeinschaft von Mitwirkenden und Organisationen gepflegt.
• Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment. • Troubleshoot and resolve complex issues across production and non-production environments. • Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance. • Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling. • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting. • Implement and refine observability strategies that enhance visibility into application and infrastructure health. • Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring. • Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction. • Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization. • Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval. • Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents. • Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems. • Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving. • Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization. • Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence. • Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents. • Document and share lessons learned, contributing to a culture of continuous improvement. • Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements. • Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language. • Measure and report on toil reduction metrics to quantify the impact of automation initiatives. • Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities. • Collaborate with DevOps and NOC teams to support the application platform. • Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders. • Provide feedback on application performance, potential improvements, and observability metrics.
• Degree in Computer Science or a related field, or equivalent professional experience. • 5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems. • 3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security. • Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management. • Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting. • Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection. • Proficiency in PromQL and experience with Loki for log aggregation and analysis. • Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization. • Cloud platform expertise (AWS preferred; GCP or Azure also valued). • Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible. • ArgoCD proficiency for GitOps workflows and continuous deployment. • Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation. • Proven track record with on-call rotations, incident response, and root cause analysis. • 1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context. • Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks. • Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines. • Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent). • Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples. • Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows. • Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents. • Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems.
• Competitive pay and benefits • Flexible vacation allowance • A hybrid / remote working environment • Startup culture backed by a secure, global brand
Jetzt Bewerben🕒 vor 3 Tagen
Senior Site Reliability Engineer at HavocAI responsible for reliability architecture and incident management. Ensuring performance, resilience, and operational maturity of mission-critical cloud services.
🇺🇸 Vereinigte Staaten – Remote
💵 $150.000 - $185.000 / Jahr
💰 Seed Round im 2024-09
⏰ Vollzeit
🟠 Senior
⛑ DevOps- und Site Reliability Engineer (SRE)
🗣️🇺🇸🇬🇧 Englisch erforderlich
🕒 vor 3 Tagen
Senior DevOps Engineer at Ad Hoc creating scalable digital services and improving software engineering processes. Collaborating with federal agencies to enhance service delivery through technology.
🇺🇸 Vereinigte Staaten – Remote
💵 $125.000 - $140.000 / Jahr
⏰ Vollzeit
🟠 Senior
⛑ DevOps- und Site Reliability Engineer (SRE)
🗣️🇺🇸🇬🇧 Englisch erforderlich
🕒 vor 3 Tagen
Senior DevSecOps Engineer at Generac managing cloud services and ensuring security and compliance in data handling. Leading efforts in secure cloud infrastructure design and integrating security in development processes.
🇺🇸 Vereinigte Staaten – Remote
💵 $145.000 - $185.000 / Jahr
💰 €200.000.000 Grant im 2024-07
⏰ Vollzeit
🟠 Senior
⛑ DevOps- und Site Reliability Engineer (SRE)
🦅 H1B-Visum-Sponsor
🗣️🇺🇸🇬🇧 Englisch erforderlich
🕒 vor 3 Tagen
DevOps Engineer designing and managing cloud environments and automation tools for RethinkFirst. Delivering CI/CD pipelines, quality code, and incident management in a fast-paced environment.
🇺🇸 Vereinigte Staaten – Remote
⏰ Vollzeit
🟡 Mittelstufe
🟠 Senior
⛑ DevOps- und Site Reliability Engineer (SRE)
🗣️🇺🇸🇬🇧 Englisch erforderlich
🕒 vor 3 Tagen
Lead Site Reliability Engineer enhancing observability and telemetry platform for athenahealth's cloud infrastructure. Collaborating with engineering teams to improve reliability and operational efficiency.
🇺🇸 Vereinigte Staaten – Remote
💵 $143.000 - $243.000 / Jahr
💰 Post-IPO Equity im 2017-05
⏰ Vollzeit
🟠 Senior
⛑ DevOps- und Site Reliability Engineer (SRE)
🦅 H1B-Visum-Sponsor
🗣️🇺🇸🇬🇧 Englisch erforderlich