Senior AI Infrastructure, Platform Operations Engineer

🔥 6 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Mirantis

Mirantis

501 - 1000 employees

🏢 Enterprise

☁️ SaaS

Cloud Computing • Enterprise • SaaS

Mirantis is a company that specializes in container management and cloud infrastructure solutions. It offers a range of products, including Mirantis Kubernetes Engine (MKE), Mirantis OpenStack for Kubernetes (MOSK), and Mirantis Container Cloud (MCC), which provide enterprise-level Kubernetes and container management platforms. Mirantis also develops tools for secure software supply chains, such as the Mirantis Container Runtime (MCR) and Mirantis Secure Registry (MSR). As an advocate for open source technologies, Mirantis supports various projects and provides resources like Lens Desktop, a popular Kubernetes IDE, and technical support for enterprises adopting cloud-native technologies. Their solutions cater to sectors such as public services, financial services, and broader SaaS and technology services industries.

📋 Description

• Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents. • Act as a senior escalation point for operational teams during critical service-impacting events. • Support large-scale NVIDIA GPU infrastructure and high-performance networking environments. • Troubleshoot complex Linux, Kubernetes, networking, storage, and hardware-related issues. • Analyze platform performance, capacity, stability, and reliability trends to proactively identify risks. • Lead root cause analysis activities and drive long-term corrective actions. • Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve complex technical challenges. • Participate in major incident management and service restoration activities. • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services. • Drive improvements in platform reliability, observability, monitoring, and operational processes. • Identify opportunities to automate repetitive operational activities and improve operational efficiency. • Contribute to operational readiness reviews, infrastructure changes, upgrades, and service introductions. • Support the adoption and operation of AI-powered infrastructure services and operational capabilities through k0rdent AI. • Evaluate emerging technologies and operational practices to improve service delivery and platform resilience. • Mentor and support AI Infrastructure & Platform Operations Engineers. • Share technical knowledge through documentation, training sessions, and operational reviews. • Develop and maintain operational standards, runbooks, troubleshooting guides, and best practices. • Help define operational processes, escalation paths, and service reliability standards. • Act as a trusted technical advisor during operational planning and service improvement initiatives.

🎯 Requirements

• 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or related technical roles. • Expert-level Linux administration and troubleshooting skills. • Strong networking expertise, including experience diagnosing complex performance, connectivity, and reliability issues. • Strong experience operating Kubernetes in production environments. • Experience supporting large-scale production infrastructure and distributed systems. • Proven experience leading technical investigations and managing complex incidents. • Experience performing root cause analysis and driving long-term operational improvements. • Strong understanding of observability, monitoring, and service reliability practices. • Excellent troubleshooting and analytical skills across multiple infrastructure domains. • Strong communication, collaboration, and stakeholder management skills.

🏖️ Benefits

• Operate some of the most advanced AI infrastructure environments in production today. • Work with the latest NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments. • Help define operational standards and reliability practices for next-generation AI infrastructure services. • Influence the adoption of AI-powered operational capabilities through k0rdent AI. • Work alongside highly skilled engineers solving complex infrastructure and platform challenges at scale. • Join a growing organisation investing heavily in AI infrastructure, platform services, and operational innovation.

Apply Now

Similar Jobs

🔥 1 hour ago

Software Mind

1001 - 5000

🤖 Artificial Intelligence

☁️ SaaS

📡 Telecommunications

Platform Engineer joining Software Mind to work on high-performance infrastructure, automation, and self-service provisioning. Collaborating with API developers and infrastructure specialists.

Cloud

Kubernetes

Linux

Microservices

Python

🕒 May 30

Equinix

5001 - 10000

📡 Telecommunications

🏢 Enterprise

☁️ SaaS

Senior Staff Platform Engineer at Equinix designing solutions for monitoring and obtaining telemetry. Join a skilled team to drive automation and infrastructure management.

Ansible

Distributed Systems

Docker

Grafana

Jenkins

Kubernetes

Linux

Prometheus

Puppet

Vault

VMware

🕒 May 29

VirtusLab

201 - 500

💳 Fintech

Data Platform Engineer designing and implementing solutions for indexing Atlan metadata. Collaborating with DevOps to ensure production readiness and compliance standards are met.

Kubernetes

Python

🕒 May 29

Hitachi

10,000+ employees

🤖 Artificial Intelligence

⚡ Energy

🚗 Transport

AI Platform Engineer responsible for designing and evolving the Global AI Platform. Collaborating with teams to ensure AI capabilities and performance meet business needs.

Azure

Cloud

Terraform

🕒 May 26

The Codest

51 - 200

💳 Fintech

🛍️ eCommerce

Senior Platform Engineer working with cloud and DevOps for an international tech software company. Focused on system operations, scaling, and automation in a collaborative environment.

AWS

Azure

Cloud

DNS

Docker

Google Cloud Platform

IPFS

Linux

Node.js

Python

Ruby

Terraform

Go