Site Reliability Engineer

Trouver des Emplois à Distance Similaires

51 - 200 employés

Fondée en 2022

🤖 Intelligence artificielle

☁️ SaaS

💰 Seed Round en 2024-05

Artificial Intelligence • Cloud Computing • SaaS

RunPod est une plateforme cloud conçue pour faciliter l'entraînement, le perfectionnement et le déploiement de modèles d'IA. Elle propose un cloud GPU distribué mondialement permettant aux utilisateurs de déployer leurs charges de travail d'IA sans effort tout en se concentrant sur la création d'applications d'apprentissage automatique. Avec des fonctionnalités telles que le démarrage rapide des pods, l'autoscaling et la prise en charge de plusieurs frameworks d'apprentissage automatique, RunPod s'adresse aussi bien aux startups, qu'aux institutions académiques et aux entreprises, offrant une solution puissante et économique pour le développement d'apprentissage automatique.

Site Reliability Engineer

🕒 il y a 27 jours

🇺🇸 États-Unis – Télétravail

💵 $150 000 - $200 000 / an

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

Distributed Systems

Grafana

Linux

Prometheus

Python

Postuler Maintenant

📊 Vérifiez votre score de CV pour ce poste

Améliorez vos chances d'obtenir un entretien en vérifiant votre score de CV avant de postuler.

RunPod

51 - 200 employés

Fondée en 2022

🤖 Intelligence artificielle

☁️ SaaS

💰 Seed Round en 2024-05

Artificial Intelligence • Cloud Computing • SaaS

Description

• Increase platform uptime and reduce incident frequency and duration • Establish and operationalize SLIs/SLOs across services • Improve MTTR through better tooling, automation, and runbooks • Strengthen production readiness standards • Drive long-term systemic reliability improvements • Define and implement SLIs/SLOs for critical services • Lead incident response and coordinate cross-team mitigation efforts • Conduct blameless postmortems and ensure corrective actions are completed • Perform production readiness reviews for new services and features • Identify systemic risks and drive preventative improvements • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.) • Improve signal-to-noise ratio in alerts and reduce alert fatigue • Build internal tooling for reliability tracking and reporting • Improve visibility into GPU performance and distributed systems health • Automate recurring operational workflows • Build tools and scripts (Python, Go, Bash) to eliminate manual processes • Improve deployment safety through automation and guardrails • Strengthen CI/CD reliability and release processes • Partner with engineering teams to improve system resilience • Provide guidance on fault tolerance, scalability, and failure handling • Contribute to architectural discussions with a reliability-first mindset.

🎯 Exigences

• 5+ years of experience in SRE, Reliability Engineering, or Production Engineering • Strong Linux systems and Networking expertise • Experience managing containerized production systems • Strong understanding of distributed systems and failure modes • Experience defining and managing SLIs/SLOs • Proven incident response and postmortem leadership experience • Strong scripting or programming skills • Experience with monitoring and alerting systems • Excellent written communication skills • Successful completion of a background check.

🏖️ Avantages

• Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside. • Generous medical, dental & vision plans • Flexible PTO- take the time you need to recharge • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication • Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

Postuler Maintenant

Emplois Similaires

Salesforce DevOps Evangelist

🕒 il y a 27 jours

Flosum

201 - 500

🤝 B2B

☁️ SaaS

Salesforce DevOps Evangelist enhancing Flosum's presence through engaging content and authentic community interactions. Driving conversations and educating developers in the Salesforce ecosystem.

🇺🇸 États-Unis – Télétravail

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

Senior DevOps Engineer

🕒 il y a 27 jours

Sharetec Systems

51 - 200

💸 Finance

🏦 Banque

💳 Fintech

Senior DevOps Engineer at Sharetec responsible for deployment and configuration of mobile applications and automation. Improve operational efficiency with Ansible and Terraform in a remote role.

🇺🇸 États-Unis – Télétravail

💵 $120 000 - $140 000 / an

⏰ Temps Plein

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

Android

Ansible

Docker

iOS

Kubernetes

Python

Terraform

DevOps Engineer – Senior

🕒 il y a 28 jours

Ad Hoc LLC

501 - 1000

🏛️ Gouvernement

🤖 Intelligence artificielle

🔌 API

DevOps Engineer III supporting government digital services at Ad Hoc. Collaborating on cloud infrastructure and improving DevOps processes for federal clients.

🇺🇸 États-Unis – Télétravail

💵 $115 000 - $125 000 / an

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

Ansible

Cloud

Jenkins

Terraform

Senior DevOps Engineer

🕒 il y a 28 jours

Ad Hoc LLC

501 - 1000

🏛️ Gouvernement

🤖 Intelligence artificielle

🔌 API

Senior DevOps Engineer at Ad Hoc supporting federal clients with scalable digital solutions. Responsible for leading DevOps strategies and mentoring team members in cloud infrastructure and CI/CD processes.

🇺🇸 États-Unis – Télétravail

💵 $125 000 - $142 000 / an

⏰ Temps Plein

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

Ansible

Cloud

Terraform

Lead Cloud – DevOps Engineer

🕒 il y a 28 jours

Blend360

501 - 1000

🤖 Intelligence artificielle

🏢 Entreprise