Senior Site Reliability Engineer

Job not on LinkedIn

October 28

Apply Now
Logo of Dev.Pro

Dev.Pro

B2B • Fintech • SaaS

Dev. Pro is a software development partner that supports technology companies with custom outsourced software development services. With over 13 years of experience, a team of more than 900 experts, and operations in over 50 countries, Dev. Pro provides a comprehensive range of services including cloud development, DevOps, software testing and QA, system integration, and application security. The company caters to a wide array of industries such as digital commerce, fintech, hospitality, and healthcare by delivering tailored software development experiences. Dev. Pro emphasizes quality, innovation, and a transparent collaboration process to accelerate growth for ambitious startups and Fortune 500 enterprises alike, ensuring successful outcomes through a well-balanced and efficient team approach.

501 - 1000 employees

Founded 2011

🤝 B2B

💳 Fintech

☁️ SaaS

📋 Description

• Automate deployment, scaling, and lifecycle management of GPU clusters • Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity • Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers • Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation • Collaborate with teams to optimize performance, resources, and fault recovery at petascale

🎯 Requirements

• 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments • Expertise in HPC workload managers (Slurm, PBS Pro, LSF) • Strong Python or Go skills for automation and observability • Infrastructure-as-code experience (Terraform, Ansible, Helm) • Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server) • GPU resource management knowledge (MIG, NCCL, CUDA, containers) • Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre) • Linux systems engineering, CI/CD, and configuration management skills • Strategic thinking with strong technical and business communication • Organization, autonomy, adaptability • Advanced English level • **Desirable:** • Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration

Apply Now

Similar Jobs

October 1

Senior DevOps Engineer/Lead responsible for CI/CD and securing cloud environments while collaborating with engineers on a transformative project. Join a high-performing team at AccelOne to modernize mission-critical applications.

Ansible

AWS

Azure

Chef

Cloud

Docker

Google Cloud Platform

Kubernetes

MySQL

Postgres

Puppet

Python

Splunk

SQL

Terraform

October 1

Python Backend & DevOps role designing APIs and orchestrating distributed systems at Interinnova. Seeking a candidate with strong DevOps skills and 4 years of experience.

🗣️🇪🇸 Spanish Required

September 29

Lead DevOps Architect building automated cloud CI/CD environments and infrastructure. Ensure security, reliability, and deployment automation while collaborating with engineering teams.

Ansible

Azure

Chef

Cloud

Docker

ElasticSearch

Jenkins

Linux

MongoDB

MySQL

NoSQL

Puppet

Python

RDBMS

Redis

Ruby

Subversion

VMware

August 1

As a DevOps Engineer at Particle41, streamline software delivery and automate IT operations processes.

Ansible

AWS

Azure

Chef

Cloud

Docker

Google Cloud Platform

Kubernetes

Puppet

Python

Ruby

Terraform

July 31

Fever

1001 - 5000

👥 B2C

Join FeverUp as an SRE / Performance Engineer leveraging Kubernetes to solve performance issues in cloud environments.

Android

Ansible

AWS

Cloud

iOS

Kubernetes

Linux

Prometheus

Terraform

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com