Senior AI Infrastructure – Platform Operations Engineer

Ähnliche Remote-Jobs finden

501 - 1000 Mitarbeiter

🏢 Unternehmen

☁️ SaaS

Cloud Computing • Enterprise • SaaS

Mirantis ist ein Unternehmen, das sich auf Container-Management und Cloud-Infrastrukturlösungen spezialisiert hat. Das Portfolio umfasst unter anderem Mirantis Kubernetes Engine (MKE), Mirantis OpenStack for Kubernetes (MOSK) und Mirantis Container Cloud (MCC) – Plattformen für Kubernetes und Container-Management auf Enterprise-Niveau. Darüber hinaus entwickelt Mirantis Werkzeuge für sichere Software-Lieferketten, etwa die Mirantis Container Runtime (MCR) und die Mirantis Secure Registry (MSR). Als Verfechter von Open-Source-Technologien unterstützt Mirantis verschiedene Projekte und stellt Ressourcen wie Lens Desktop, eine beliebte Kubernetes-IDE, sowie technischen Support für Unternehmen bereit, die Cloud-native Technologien einführen. Die Lösungen von Mirantis richten sich an Bereiche wie den öffentlichen Sektor, Finanzdienstleistungen sowie SaaS- und Technologiedienstleistungen.

Senior AI Infrastructure – Platform Operations Engineer

🔥 vor 10 Minuten

🇪🇺 Europa – Remote

⏰ Vollzeit

🟠 Senior

🏗️ Plattformingenieur

🗣️🇺🇸🇬🇧 Englisch erforderlich

Cloud

Distributed Systems

Grafana

Kubernetes

Linux

Prometheus

Jetzt Bewerben

📊 Überprüfen Sie Ihre Lebenslauf-Bewertung für diese Stelle

Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

Mirantis

501 - 1000 Mitarbeiter

🏢 Unternehmen

☁️ SaaS

Cloud Computing • Enterprise • SaaS

Beschreibung

• Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents. • Act as a senior escalation point for operational teams during critical service-impacting events. • Support large-scale NVIDIA GPU infrastructure and high-performance networking environments. • Troubleshoot complex Linux, Kubernetes, networking, storage, and hardware-related issues. • Analyze platform performance, capacity, stability, and reliability trends to proactively identify risks. • Lead root cause analysis activities and drive long-term corrective actions. • Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve complex technical challenges. • Participate in major incident management and service restoration activities. • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services. • Drive improvements in platform reliability, observability, monitoring, and operational processes. • Identify opportunities to automate repetitive operational activities and improve operational efficiency. • Contribute to operational readiness reviews, infrastructure changes, upgrades, and service introductions. • Support the adoption and operation of AI-powered infrastructure services and operational capabilities through k0rdent AI. • Evaluate emerging technologies and operational practices to improve service delivery and platform resilience. • Mentor and support AI Infrastructure & Platform Operations Engineers. • Share technical knowledge through documentation, training sessions, and operational reviews. • Develop and maintain operational standards, runbooks, troubleshooting guides, and best practices. • Help define operational processes, escalation paths, and service reliability standards. • Act as a trusted technical advisor during operational planning and service improvement initiatives.

🎯 Anforderungen

• 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or related technical roles. • Expert-level Linux administration and troubleshooting skills. • Strong networking expertise, including experience diagnosing complex performance, connectivity, and reliability issues. • Strong experience operating Kubernetes in production environments. • Experience supporting large-scale production infrastructure and distributed systems. • Proven experience leading technical investigations and managing complex incidents. • Experience performing root cause analysis and driving long-term operational improvements. • Strong understanding of observability, monitoring, and service reliability practices. • Excellent troubleshooting and analytical skills across multiple infrastructure domains. • Strong communication, collaboration, and stakeholder management skills. • Experience in one or more of the following areas is highly desirable: NVIDIA GPU infrastructure and accelerated computing platforms. • InfiniBand networking and NVIDIA UFM. • AI infrastructure environments. • HPC environments. • Platform Engineering or Site Reliability Engineering (SRE). • Large-scale Kubernetes operations. • Infrastructure automation technologies and Infrastructure-as-Code practices. • Observability platforms such as Grafana, Prometheus, ELK, or OpenTelemetry. • Performance analysis and optimisation of distributed infrastructure platforms. • Technical leadership, mentoring, or team lead responsibilities.

🏖️ Vorteile

• Operate some of the most advanced AI infrastructure environments in production today. • Work with the latest NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments. • Help define operational standards and reliability practices for next-generation AI infrastructure services. • Influence the adoption of AI-powered operational capabilities through k0rdent AI. • Work alongside highly skilled engineers solving complex infrastructure and platform challenges at scale. • Join a growing organisation investing heavily in AI infrastructure, platform services, and operational innovation.

Jetzt Bewerben

Ähnliche Jobs

Senior Plattformingenieur

🕒 vor 7 Tagen

Vira Games

51 - 200

🎮 Gaming

👥 B2C

Senior Plattformingenieur, der Backend‑Services für ein Gaming‑Unternehmen entwirft und entwickelt. Schwerpunkt auf GaaS‑Plattformarchitektur, Qualitätssicherung und Infrastrukturlösungen.

🇪🇺 Europa – Remote

⏰ Vollzeit

🟠 Senior

🏗️ Plattformingenieur

🗣️🇺🇦 Ukrainisch erforderlich

AWS

NoSQL

Python

Plattform-Engineer

🕒 vor 1 Monat

bloomon

51 - 200

🛒 Einzelhandel

🛍️ eCommerce

Plattform-Engineer, der in verschiedenen Technologiebereichen bei Bloom & Wild arbeitet. Verbesserung von E-Commerce-, Daten- und Infrastruktur-Lösungen mit Fokus auf Autonomie und Innovation.

🇪🇺 Europa – Remote

💰 Series C im 2019-03

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

🏗️ Plattformingenieur

🗣️🇺🇸🇬🇧 Englisch erforderlich

AWS

Python

Ruby

Terraform

Senior Platform Engineer

🕒 vor 1 Monat

saas.group

51 - 200

☁️ SaaS

🏢 Unternehmen

🤝 B2B

Senior Platform Engineer für ScraperAPI, verantwortlich für das Management und die Konsolidierung der Infrastruktur für leistungsstarke Web-Scraping-Lösungen. Zusammenarbeit mit Engineering-Teams zur Umsetzung wesentlicher Verbesserungen der Plattform.

🇪🇺 Europa – Remote

⏰ Vollzeit

🟠 Senior

🏗️ Plattformingenieur

🗣️🇺🇸🇬🇧 Englisch erforderlich

Kubernetes

Prometheus

Terraform

Senior Platform Engineer (Cloud- und KI-Einführung) – Remote

🕒 vor 3 Monaten

TD SYNNEX

10.000+ Mitarbeiter

🏢 Unternehmen

☁️ SaaS

📡 Telekommunikation

Senior Platform Engineer bei TD SYNNEX, der Multi-Cloud-Infrastrukturen für KI-gesteuerte Anwendungen entwirft. Schwerpunkt auf Automatisierung und Zusammenarbeit zwischen Entwicklern, Fachbereich und Betrieb.

🇪🇺 Europa – Remote

⏰ Vollzeit

🟠 Senior

🏗️ Plattformingenieur

🗣️🇺🇸🇬🇧 Englisch erforderlich

Ansible

AWS

Azure

Cloud

Google Cloud Platform

Linux

Python

Terraform

Senior Plattformingenieur

🕒 vor 3 Monaten

Polar

1 - 10

💳 Fintech

☁️ SaaS

🔌 API

Senior Plattformingenieur, der die Polar-Plattform für hochdynamische Start-ups entwirft und weiterentwickelt. Konzeption von Systemen mit Schwerpunkt auf Zuverlässigkeit und Skalierbarkeit in finanziellen Workflows über verschiedene technische Ebenen hinweg.

🇪🇺 Europa – Remote

⏰ Vollzeit

🟠 Senior

🏗️ Plattformingenieur

🗣️🇺🇸🇬🇧 Englisch erforderlich

Cloud

Distributed Systems

Open Source

SQL