Platform Engineer, AI/ML Infrastructure

Job not on LinkedIn

August 18

Apply Now
Logo of SECURENTITY

SECURENTITY

Cybersecurity • Enterprise • SaaS

SECURENTITY is a company specializing in Identity and Access Management (IAM) solutions. They provide a comprehensive suite of services designed to secure access to digital environments, empowering organizations to manage identities, enforce access control, and modernize their infrastructure effectively. With a commitment to simplifying IAM challenges, SECURENTITY offers managed services, cloud IAM solutions, and expert guidance to enhance security and streamline user management.

📋 Description

• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments • Are passionate about building platforms that empower developers and researchers. • Enjoy creating elegant, automated solutions for complex infrastructure challenges in both cloud and data center environments. • Thrive on optimizing hybrid infrastructure for performance, cost, and reliability. • Are excited to work at the intersection of modern platform engineering and cutting-edge AI. • Love to treat infrastructure as a product, continuously improving the developer experience. • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE). • Proven, hands-on experience building and managing production infrastructure with Terraform. • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment. • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads. • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management. • Strong scripting and automation skills (e.g., Python, Go, Bash). • Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling. • Familiarity with FinOps principles and cloud cost optimization strategies. • Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions. • Experience in a multi-region or hybrid cloud environment.

🎯 Requirements

• 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE). • Proven, hands-on experience building and managing production infrastructure with Terraform. • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment. • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads. • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management. • Strong scripting and automation skills (e.g., Python, Go, Bash).

🏖️ Benefits

• Offers Equity • Offers Bonus • 10% Annual Bonus

Apply Now

Similar Jobs

August 15

Owner.com

11 - 50

Own CI/CD across backend, frontend, and mobile at Owner.com. Focus on iOS build pipelines, signing, and fast app delivery.

Android

Cloud

Docker

Flutter

Gradle

iOS

JavaScript

Jenkins

MacOS

Microservices

Node.js

Python

React

React Native

Ruby

Terraform

August 11

Implement automation, CI/CD, and secure infra for Scene Health's healthcare platform. Remote role with regulatory and security focus.

🗣️🇪🇸 Spanish Required

Ansible

AWS

Chef

Cloud

Distributed Systems

Docker

JavaScript

Kubernetes

Python

Ruby

Terraform

August 3

Senior Platform Engineer at Beautiful.ai responsible for core infrastructure design and mentoring engineers.

AWS

Azure

Cloud

Firebase

Google Cloud Platform

JavaScript

Microservices

MongoDB

Node.js

Postgres

React

Webpack

August 3

As a Senior Platform Engineer at Railway, you'll design scalable infrastructure for storage systems.

Distributed Systems

GRPC

Go

July 29

Deepgram is the leading voice AI platform looking for a Platform Engineer to build and operate a hybrid infrastructure.

AWS

Cloud

Jenkins

Kubernetes

Python

Terraform

Go

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com