Post a Job Affiliates

Search Remote Jobs

Dev.Pro

Website LinkedIn All Job Openings

B2B • Fintech • SaaS

Dev. Pro is a software development partner that supports technology companies with custom outsourced software development services. With over 13 years of experience, a team of more than 900 experts, and operations in over 50 countries, Dev. Pro provides a comprehensive range of services including cloud development, DevOps, software testing and QA, system integration, and application security. The company caters to a wide array of industries such as digital commerce, fintech, hospitality, and healthcare by delivering tailored software development experiences. Dev. Pro emphasizes quality, innovation, and a transparent collaboration process to accelerate growth for ambitious startups and Fortune 500 enterprises alike, ensuring successful outcomes through a well-balanced and efficient team approach.

501 - 1000 employees

Founded 2011

🤝 B2B

💳 Fintech

☁️ SaaS

Senior Site Reliability Engineer

Job not on LinkedIn

October 28

🇦🇷 Argentina – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

Kubernetes

Linux

Python

Ray

Terraform

Apply Now

Dev.Pro

Website LinkedIn All Job Openings

B2B • Fintech • SaaS

501 - 1000 employees

Founded 2011

🤝 B2B

💳 Fintech

☁️ SaaS

📋 Description

• Automate deployment, scaling, and lifecycle management of GPU clusters • Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity • Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers • Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation • Collaborate with teams to optimize performance, resources, and fault recovery at petascale

🎯 Requirements

• 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments • Expertise in HPC workload managers (Slurm, PBS Pro, LSF) • Strong Python or Go skills for automation and observability • Infrastructure-as-code experience (Terraform, Ansible, Helm) • Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server) • GPU resource management knowledge (MIG, NCCL, CUDA, containers) • Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre) • Linux systems engineering, CI/CD, and configuration management skills • Strategic thinking with strong technical and business communication • Organization, autonomy, adaptability • Advanced English level • **Desirable:** • Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration

Apply Now

Similar Jobs

DevOps Engineer, Lead

October 1

AccelOne

51 - 200

🤝 B2B

Website LinkedIn All Job Openings

Senior DevOps Engineer/Lead responsible for CI/CD and securing cloud environments while collaborating with engineers on a transformative project. Join a high-performing team at AccelOne to modernize mission-critical applications.

🇦🇷 Argentina – Remote

💰 $100k Seed Round on 2021-11

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Chef

Cloud

Docker

Google Cloud Platform

Kubernetes

MySQL

Postgres

Puppet

Python

Splunk

SQL

Terraform

Apply

View Job

Python Backend, DevOps

October 1

InnovativeDev

11 - 50

🛍️ eCommerce

🤝 B2B

☁️ SaaS

Website LinkedIn All Job Openings

Python Backend & DevOps role designing APIs and orchestrating distributed systems at Interinnova. Seeking a candidate with strong DevOps skills and 4 years of experience.

🇦🇷 Argentina – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇪🇸 Spanish Required

Apply

View Job

DevOps Architect

September 29

Creative Chaos

201 - 500

🤝 B2B

☁️ SaaS

⚡ Productivity

Website LinkedIn All Job Openings

Lead DevOps Architect building automated cloud CI/CD environments and infrastructure. Ensure security, reliability, and deployment automation while collaborating with engineering teams.

🇦🇷 Argentina – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

Azure

Chef

Cloud

Docker

ElasticSearch

Jenkins