Senior Site Reliability Engineer

🕒 vor 2 Monaten

🇺🇸 Vereinigte Staaten – Remote

💵 $150.000 - $200.000 / Jahr

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

info

🗣️🇺🇸🇬🇧 Englisch erforderlich

Jetzt Bewerben
Ähnliche Remote-Jobs finden

📊 Überprüfen Sie Ihre Lebenslauf-Bewertung für diese Stelle

Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

Logo of Backblaze

Backblaze

201 - 500 Mitarbeiter

Gegründet 2007

🛍️ eCommerce

🏢 Unternehmen

💰 €5.000.000 Series A im 2012-07

Cloud Storage • eCommerce • Enterprise

Backblaze ist ein Cloud-Speicherunternehmen, das skalierbare und sichere Datensicherungslösungen sowohl für Unternehmen als auch für Privatpersonen bietet. Ihr B2 Cloud Storage-Service bietet S3-kompatiblen Objektspeicher, der es den Nutzern ermöglicht, ihre Daten mit transparenter Preisgestaltung einfach zu schützen und zu verwalten. Backblaze ist auf automatische und unbegrenzte Backup-Dienste für Computersysteme spezialisiert, um den Benutzern Datensicherheit und Wiederherstellungsoptionen zu gewährleisten, während gleichzeitig die Integration mit Anwendungen für erweiterte Funktionalitäten unterstützt wird.

Beschreibung

• Own and drive the availability, durability, and performance of critical services across all production environments. • Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership. • Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services. • Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes. • Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management). • Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform. • Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability. • Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins). • Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems. • Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation. • Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features. • Lead capacity planning and disaster recovery strategy across critical infrastructure components. • Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance. • Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams. • Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation. • Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans. • Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.

🎯 Anforderungen

• Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience). • 8+ years of progressive experience in site reliability, systems engineering, or operations. • Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems. • Expert-level Linux systems administration and advanced troubleshooting skills. • Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification. • Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis. • Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred). • Expert knowledge of incident response methodologies and operational best practices. • Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required. • Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment.

🏖️ Vorteile

• Healthcare for family, including dental and vision • Competitive compensation and 401K • RSU grants for full-time employees • ESPP program • Flexible vacation policy • Maternity & paternity leave • MacBook Pro to use for work, plus a generous stipend to personalize your workstation • Childcare bonus (human children only) • Fertility treatment and support • Learning & development program • Commuter benefits • Culture that supports a healthy work-life balance

Jetzt Bewerben

Ähnliche Jobs

🕒 vor 2 Monaten

Arista Networks

1001 - 5000

🏢 Unternehmen

📡 Telekommunikation

Site Reliability Engineer at Arista managing CloudVision-as-a-Service platform, ensuring global service reliability, scalability, and stability with a focus on automation and operational excellence.

🇺🇸 Vereinigte Staaten – Remote

💵 $101.000 - $161.000 / Jahr

💰 €2.600.000 Post-IPO Debt im 2015-05

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

info

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 2 Monaten

Chainlink Labs

201 - 500

💸 Finanzen

💳 Fintech

🌐 Web 3

Senior Site Reliability Engineer designing infrastructure primitives for decentralized networks. Collaborate on Kubernetes-based control planes and improve operational efficiency.

🇺🇸 Vereinigte Staaten – Remote

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 2 Monaten

Latitude.sh

51 - 200

🎮 Gaming

💳 Fintech

Senior Site Reliability Engineer designing and implementing tools for reliable cloud infrastructure. Collaborating with teams to enhance system observability and incident response.

🇺🇸 Vereinigte Staaten – Remote

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 2 Monaten

NBCUniversal

10.000+ Mitarbeiter

📱 Medien

Staff Software Engineer overseeing day-to-day operational support of SAP BTP applications at NBCUniversal. Collaborating with onsite teams to enhance engineering strategies and manage production deployments.

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 2 Monaten

Docusign

5001 - 10000

🛍️ eCommerce

💸 Finanzen

☁️ SaaS

Senior Site Reliability Engineer at Docusign managing critical systems and driving reliability initiatives. Collaborating with teams to enhance observability and incident response for high-impact services across cloud environments.

🗣️🇺🇸🇬🇧 Englisch erforderlich