Senior AI Infrastructure, Platform Operations Engineer

đŸ”„ vor 3 Minuten

đŸ‡ȘđŸ‡ș Europa – Remote

⏰ Vollzeit

🟠 Senior

đŸ‘· IT-Infrastrukturingenieur

đŸ—ŁïžđŸ‡ș🇾🇬🇧 Englisch erforderlich

Jetzt Bewerben
Ähnliche Remote-Jobs finden

📊 ÜberprĂŒfen Sie Ihre Lebenslauf-Bewertung fĂŒr diese Stelle

Verbessern Sie Ihre Chancen auf ein VorstellungsgesprĂ€ch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung ĂŒberprĂŒfen.

Logo of Mirantis

Mirantis

501 - 1000 Mitarbeiter

🏱 Unternehmen

☁ SaaS

Cloud Computing ‱ Enterprise ‱ SaaS

Mirantis ist ein Unternehmen, das sich auf Container-Management und Cloud-Infrastrukturlösungen spezialisiert hat. Das Portfolio umfasst unter anderem Mirantis Kubernetes Engine (MKE), Mirantis OpenStack for Kubernetes (MOSK) und Mirantis Container Cloud (MCC) – Plattformen fĂŒr Kubernetes und Container-Management auf Enterprise-Niveau. DarĂŒber hinaus entwickelt Mirantis Werkzeuge fĂŒr sichere Software-Lieferketten, etwa die Mirantis Container Runtime (MCR) und die Mirantis Secure Registry (MSR). Als Verfechter von Open-Source-Technologien unterstĂŒtzt Mirantis verschiedene Projekte und stellt Ressourcen wie Lens Desktop, eine beliebte Kubernetes-IDE, sowie technischen Support fĂŒr Unternehmen bereit, die Cloud-native Technologien einfĂŒhren. Die Lösungen von Mirantis richten sich an Bereiche wie den öffentlichen Sektor, Finanzdienstleistungen sowie SaaS- und Technologiedienstleistungen.

Beschreibung

‱ Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents. ‱ Act as a senior escalation point for operational teams during critical service-impacting events. ‱ Support large-scale NVIDIA GPU infrastructure and high-performance networking environments. ‱ Troubleshoot complex Linux, Kubernetes, networking, storage, and hardware-related issues. ‱ Analyze platform performance, capacity, stability, and reliability trends to proactively identify risks. ‱ Lead root cause analysis activities and drive long-term corrective actions. ‱ Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve complex technical challenges. ‱ Participate in major incident management and service restoration activities. ‱ Provide technical leadership for Kubernetes platform operations and supporting infrastructure services. ‱ Drive improvements in platform reliability, observability, monitoring, and operational processes. ‱ Identify opportunities to automate repetitive operational activities and improve operational efficiency. ‱ Contribute to operational readiness reviews, infrastructure changes, upgrades, and service introductions. ‱ Support the adoption and operation of AI-powered infrastructure services and operational capabilities through k0rdent AI. ‱ Evaluate emerging technologies and operational practices to improve service delivery and platform resilience. ‱ Mentor and support AI Infrastructure & Platform Operations Engineers. ‱ Share technical knowledge through documentation, training sessions, and operational reviews. ‱ Develop and maintain operational standards, runbooks, troubleshooting guides, and best practices. ‱ Help define operational processes, escalation paths, and service reliability standards. ‱ Act as a trusted technical advisor during operational planning and service improvement initiatives.

🎯 Anforderungen

‱ 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or related technical roles. ‱ Expert-level Linux administration and troubleshooting skills. ‱ Strong networking expertise, including experience diagnosing complex performance, connectivity, and reliability issues. ‱ Strong experience operating Kubernetes in production environments. ‱ Experience supporting large-scale production infrastructure and distributed systems. ‱ Proven experience leading technical investigations and managing complex incidents. ‱ Experience performing root cause analysis and driving long-term operational improvements. ‱ Strong understanding of observability, monitoring, and service reliability practices. ‱ Excellent troubleshooting and analytical skills across multiple infrastructure domains. ‱ Strong communication, collaboration, and stakeholder management skills. ‱ Experience in one or more of the following areas is highly desirable: NVIDIA GPU infrastructure and accelerated computing platforms, InfiniBand networking and NVIDIA UFM, AI infrastructure environments, HPC environments, Platform Engineering or Site Reliability Engineering (SRE), Large-scale Kubernetes operations, Infrastructure automation technologies and Infrastructure-as-Code practices, Observability platforms such as Grafana, Prometheus, ELK, or OpenTelemetry, Performance analysis and optimisation of distributed infrastructure platforms, Technical leadership, mentoring, or team lead responsibilities.

đŸ–ïž Vorteile

‱ Operate some of the most advanced AI infrastructure environments in production today. ‱ Work with the latest NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments. ‱ Help define operational standards and reliability practices for next-generation AI infrastructure services. ‱ Influence the adoption of AI-powered operational capabilities through k0rdent AI. ‱ Work alongside highly skilled engineers solving complex infrastructure and platform challenges at scale. ‱ Join a growing organisation investing heavily in AI infrastructure, platform services, and operational innovation.

Jetzt Bewerben

Ähnliche Jobs

🕒 vor 5 Tagen

NIR-YU

201 - 500

🎯 Rekrutierung

đŸ‘„ HR Tech

🏱 Unternehmen

Senior Unity-Ingenieur, der die clientseitige Infrastruktur fĂŒr eine VR-Trainingsplattform entwickelt. Schwerpunkt auf Architektur und Optimierung in einer flexiblen, vollstĂ€ndig remote ausgelegten Arbeitsumgebung.

đŸ‡ȘđŸ‡ș Europa – Remote

⏰ Vollzeit

🟠 Senior

đŸ‘· IT-Infrastrukturingenieur

đŸ—ŁïžđŸ‡ș🇾🇬🇧 Englisch erforderlich

Unity

🕒 vor 28 Tagen

Thrill

11 - 50

🎼 Gaming

đŸ„œ AR/VR

Data-Warehouse- und Infrastruktur-Ingenieur, der ClickHouse-Abfragen optimiert und die Dateninfrastruktur bei Thrill Labs verwaltet. Verantwortlich fĂŒr die Pflege von Datenmodellen und Dashboards sowie die Sicherstellung von DatenqualitĂ€t und Performance.

đŸ‡ȘđŸ‡ș Europa – Remote

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

đŸ‘· IT-Infrastrukturingenieur

đŸ—ŁïžđŸ‡ș🇾🇬🇧 Englisch erforderlich

🕒 vor 3 Monaten

Amplemarket

51 - 200

đŸ€– KĂŒnstliche Intelligenz

đŸ€ B2B

☁ SaaS

Infrastructure Engineer bei Amplemarket, das KI fĂŒr B2B-Vertriebslösungen einsetzt. Aufbau skalierbarer Systeme fĂŒr ZuverlĂ€ssigkeit und Förderung bereichsĂŒbergreifender Zusammenarbeit.

đŸ‡ȘđŸ‡ș Europa – Remote

💰 €12.000.000 Series A im 2022-04

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

đŸ‘· IT-Infrastrukturingenieur

đŸ—ŁïžđŸ‡ș🇾🇬🇧 Englisch erforderlich