Site Reliability Engineer

🕒 vor 23 Tagen

🇺🇸 Vereinigte Staaten – Remote

💵 $150.000 - $200.000 / Jahr

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

Jetzt Bewerben
Ähnliche Remote-Jobs finden

📊 Überprüfen Sie Ihre Lebenslauf-Bewertung für diese Stelle

Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

Logo of RunPod

RunPod

51 - 200 Mitarbeiter

Gegründet 2022

🤖 Künstliche Intelligenz

☁️ SaaS

💰 Seed Round im 2024-05

Artificial Intelligence • Cloud Computing • SaaS

RunPod ist eine cloudbasierte Plattform, die entwickelt wurde, um das Training, die Feinabstimmung und den Einsatz von KI-Modellen zu erleichtern. Sie bietet eine global verteilte GPU-Cloud, die es Nutzern ermöglicht, ihre KI-Workloads nahtlos bereitzustellen und sich auf den Aufbau von Machine-Learning-Anwendungen zu konzentrieren. Mit Funktionen wie schnellem Pod-Spin-up, Auto-Scaling und Unterstützung für verschiedene Machine-Learning-Frameworks bietet RunPod Start-ups, akademischen Institutionen und Unternehmen gleichermaßen eine leistungsstarke und kosteneffektive Lösung für die Entwicklung von Machine Learning.

Beschreibung

• Increase platform uptime and reduce incident frequency and duration • Establish and operationalize SLIs/SLOs across services • Improve MTTR through better tooling, automation, and runbooks • Strengthen production readiness standards • Drive long-term systemic reliability improvements • Define and implement SLIs/SLOs for critical services • Lead incident response and coordinate cross-team mitigation efforts • Conduct blameless postmortems and ensure corrective actions are completed • Perform production readiness reviews for new services and features • Identify systemic risks and drive preventative improvements • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.) • Improve signal-to-noise ratio in alerts and reduce alert fatigue • Build internal tooling for reliability tracking and reporting • Improve visibility into GPU performance and distributed systems health • Automate recurring operational workflows • Build tools and scripts (Python, Go, Bash) to eliminate manual processes • Improve deployment safety through automation and guardrails • Strengthen CI/CD reliability and release processes • Partner with engineering teams to improve system resilience • Provide guidance on fault tolerance, scalability, and failure handling • Contribute to architectural discussions with a reliability-first mindset.

🎯 Anforderungen

• 5+ years of experience in SRE, Reliability Engineering, or Production Engineering • Strong Linux systems and Networking expertise • Experience managing containerized production systems • Strong understanding of distributed systems and failure modes • Experience defining and managing SLIs/SLOs • Proven incident response and postmortem leadership experience • Strong scripting or programming skills • Experience with monitoring and alerting systems • Excellent written communication skills • Successful completion of a background check.

🏖️ Vorteile

• Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside. • Generous medical, dental & vision plans • Flexible PTO- take the time you need to recharge • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication • Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

Jetzt Bewerben

Ähnliche Jobs

🕒 vor 23 Tagen

Flosum

201 - 500

🤝 B2B

☁️ SaaS

Salesforce DevOps Evangelist enhancing Flosum's presence through engaging content and authentic community interactions. Driving conversations and educating developers in the Salesforce ecosystem.

🇺🇸 Vereinigte Staaten – Remote

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 23 Tagen

Sharetec Systems

51 - 200

💸 Finanzen

🏦 Bankwesen

💳 Fintech

Senior DevOps Engineer at Sharetec responsible for deployment and configuration of mobile applications and automation. Improve operational efficiency with Ansible and Terraform in a remote role.

🇺🇸 Vereinigte Staaten – Remote

💵 $120.000 - $140.000 / Jahr

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 23 Tagen

Ad Hoc LLC

501 - 1000

🏛️ Regierung

🤖 Künstliche Intelligenz

🔌 API

DevOps Engineer III supporting government digital services at Ad Hoc. Collaborating on cloud infrastructure and improving DevOps processes for federal clients.

🇺🇸 Vereinigte Staaten – Remote

💵 $115.000 - $125.000 / Jahr

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 23 Tagen

Ad Hoc LLC

501 - 1000

🏛️ Regierung

🤖 Künstliche Intelligenz

🔌 API

Senior DevOps Engineer at Ad Hoc supporting federal clients with scalable digital solutions. Responsible for leading DevOps strategies and mentoring team members in cloud infrastructure and CI/CD processes.

🇺🇸 Vereinigte Staaten – Remote

💵 $125.000 - $142.000 / Jahr

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 23 Tagen

Blend360

501 - 1000

🤖 Künstliche Intelligenz

🏢 Unternehmen

Senior Cloud & DevOps Engineer supporting AWS data platform for telecommunications client. Focusing on provisioning and operating AWS core infrastructure and CI/CD pipelines.

🇺🇸 Vereinigte Staaten – Remote

💵 $65 - $75 / Stunde

💰 €100.000.000 Private Equity Round im 2022-08

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

info

🗣️🇺🇸🇬🇧 Englisch erforderlich