DevOps Engineer – AI Inference

🔥 2 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Gcore

Gcore

201 - 500 employees

🔐 Security

🤖 Artificial Intelligence

Cloud Computing • Security • Artificial Intelligence

Gcore is a global provider of cloud, edge, and AI solutions that accelerate AI training, deliver comprehensive cloud services, enhance content delivery, and protect servers and applications. With over 180 points of presence worldwide and a network capacity of 200+ Tbps, Gcore offers secure, flexible, and scalable infrastructure services. Its integrated offerings, including Edge Cloud, Edge Network, Edge Security, and AI Infrastructure, are designed to meet the needs of businesses looking to scale and control their global infrastructure efficiently. Gcore also provides robust DDoS protection and origin shielding to ensure uninterrupted online operations, making it a trusted partner for thousands of businesses worldwide.

📋 Description

• Design, develop, and maintain infrastructure for AI inference workloads, including GPU scheduling, model deployment pipelines, and data access patterns in on-prem environments • Build and manage monitoring and observability tools for AI inference platforms, including dashboards, alerts, and runbooks for model health and system performance • Collaborate with ML engineers and platform teams to design system architecture for AI workloads, integrate inference runtimes, and test performance at scale

🎯 Requirements

• Hands-on experience deploying, operating, and troubleshooting Kubernetes clusters, including Helm, Docker, or CRI-O. • Strong understanding of Linux systems and networking concepts, including troubleshooting connectivity and performance issues. • Ability to develop automation and operational tooling using Python, Go, or Bash. • Experience provisioning and managing infrastructure with tools such as Terraform and Ansible. • Experience designing, implementing, and maintaining CI/CD pipelines using GitLab CI or GitHub Actions. • Preferred Qualifications • Experience operating or administering Slurm clusters. • Experience with Cluster API (CAPI) or other Kubernetes cluster lifecycle management ("Kubeception") technologies. • Deep understanding of Kubernetes internals, including CNI, CSI, Operators, and cluster architecture. • Nice to Have • Experience with Kubernetes ecosystem tools such as Argo CD and Helmfile. • Experience with Prometheus. • Familiarity with other Cloud Native technologies

🏖️ Benefits

• Competitive compensation • Flexible working hours and hybrid or remote options, depending on your role • Work from anywhere in the world for up to 45 days per year • Private medical insurance for you and your family* • Extra paid vacation and sick leave days* • Support for life’s important moments and celebrations • Language courses to help you connect and grow • Modern, welcoming offices with snacks, drinks, and entertainment* • Team sports and social activities*

Apply Now

Similar Jobs

🔥 32 minutes ago

Software Mind

1001 - 5000

🤖 Artificial Intelligence

☁️ SaaS

📡 Telecommunications

Engineer developing AWS cloud infrastructure and supporting client projects at Software Mind. Engage in multi-cloud delivery with Kubernetes and automation activities.

AWS

Azure

Cloud

Docker

Jenkins

Kubernetes

Python

Terraform

🔥 12 hours ago

OpenX

201 - 500

Senior Cloud Site Reliability Engineer responsible for optimizing cloud-native systems on GCP. Collaborating with global teams to ensure performance, uptime, and growth of OpenX services.

🇵🇱 Poland – Remote

💵 zł22.5k - zł25.2k / month

💰 Secondary Market on 2015-05

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Google Cloud Platform

Java

Python

Shell Scripting

Terraform

Go

🕒 Yesterday

Profitroom

201 - 500

☁️ SaaS

🤝 B2B

🤖 Artificial Intelligence

Lead DevOps Engineer responsible for cloud infrastructure and complex deployments in an IT Services company. Collaborating with software teams and ensuring automation and stability in operations.

🗣️🇵🇱 Polish Required

Ansible

AWS

Cloud

Docker

Google Cloud Platform

Grafana

JavaScript

Kubernetes

Linux

MariaDB

NGINX

Node.js

Perl

PHP

Prometheus

Python

SDLC

Terraform

Go

🕒 Yesterday

Pragmatike

11 - 50

🎯 Recruiter

👥 HR Tech

🤝 B2B

Senior Site Reliability Engineer managing Kubernetes clusters and Linux infrastructure for cloud computing solutions. Focusing on automation, observability, and system availability in a remote setting.

Ansible

Cloud

Distributed Systems

Grafana

Kubernetes

Linux

Node.js

OpenStack

Prometheus

Python

VMware

🕒 June 23

FinteqHub

51 - 200

💳 Fintech

☁️ SaaS

🤝 B2B

DevSecOps Engineer focusing on application security and CI/CD pipeline hardening at FinteqHub's Security Team. Collaborating with engineering teams to enhance security workflows.

Ansible

AWS

Azure

Cloud

Google Cloud Platform

Kubernetes

Python

Terraform