Site Reliability Engineering Manager

July 10

Apply Now
Logo of Wikimedia Foundation

Wikimedia Foundation

Non-profit • Education • Media

Wikimedia Foundation is a nonprofit charitable organization dedicated to the growth, development, and distribution of free, multilingual content. It provides the essential infrastructure for free knowledge, including hosting Wikipedia, the free online encyclopedia that is created, edited, and verified by a global community of volunteers. Supported primarily through donations, Wikimedia Foundation promotes collaborative projects that aim to share knowledge reflecting human diversity and strives to protect everyone's right to access free and open knowledge.

501 - 1000 employees

Founded 2003

🤝 Non-profit

📚 Education

📱 Media

💰 $2.5M Grant on 2019-09

📋 Description

•Managing one to two globally distributed teams within Wikimedia’s Site Reliability Engineering organization •Providing guidance, mentorship, and support to ensure the team's effectiveness and growth •Working with team members to set individual performance goals, and supporting them in meeting and evolving their goals and career path •Recruiting, hiring, and helping onboard new team members •Triaging incoming workload, maintaining focus on priorities, and setting realistic expectations for both peers and team members •Coordinating and communicating with other members of the Wikimedia product & engineering teams on relevant projects, executing complex projects and contributing to the organizational strategy •Continuously developing the roadmap of the team in alignment with other SRE and Product & Technology teams, and helping to draft and execute the team’s annual and quarterly plans •Project managing new and existing initiatives •Leading the definition, refinement, and execution of the processes through which the team manages and performs work •Leading incident response, diagnosis, and follow-up on system alerts and outages across Wikimedia’s production infrastructure •Be part of 24/7 on-call rotation to handle escalations and provide support for teams to resolve issues •Facilitating the definition and establishment of Service Level Indicators and Objectives with service owners and stakeholders

🎯 Requirements

•Prior experience managing teams •Prior hands-on experience with software or reliability engineering (within the last 3 years preferred) •Ability to analyze complex systems, troubleshoot issues, and devise effective solutions under pressure •Proficiency in project management methodologies to effectively plan, execute, and track new and existing initiatives •Strong understanding of cloud computing, networking, Linux systems administration, containerization (e.g., Docker, Kubernetes), and infrastructure as code (e.g., Terraform, Ansible) to be able to provide technical support to the team •Aptitude for automation and streamlining of tasks •Communicate effectively in both spoken and written English •Ability to work independently, as an effective part of a globally distributed team •Ability to travel several times a year for occasional in-person meetings •B.S. or M.S. in Computer Science or the equivalent in related work experience

🏖️ Benefits

•U.S. Benefits & Perks

Apply Now

Similar Jobs

July 9

Tekmetric

51 - 200

Join Tekmetric as a Site Reliability Engineer to manage reliable cloud infrastructure and enhance system performance.

AWS

Cloud

Docker

Google Cloud Platform

Grafana

Java

JavaScript

Kubernetes

Prometheus

Python

Terraform

Go

July 8

Join Intermedia as a DevOps Engineer to deploy and maintain application infrastructure and collaborate with development teams.

Ansible

AWS

Cloud

Docker

ElasticSearch

ETL

Jenkins

Kubernetes

Linux

MySQL

Python

RabbitMQ

Redis

Go

July 6

Senior Site Reliability Engineer managing GCP infrastructure and DevOps practices. Help reduce wildfire risks using advanced technology.

Cloud

Google Cloud Platform

Kubernetes

Unix

July 4

Join Resonance as a DevOps Engineer to build and maintain an AI-driven platform for fashion.

Airflow

AWS

Cloud

Distributed Systems

Docker

EC2

Grafana

GraphQL

Jenkins

Kafka

Kubernetes

Microservices

NoSQL

Prometheus

Terraform

July 3

Join MetaRouter as a Senior Site Reliability Engineer to enhance critical infrastructure operations. Experience with cloud environments and SRE practices required.

Cloud

Docker

Google Cloud Platform

JavaScript

Kubernetes

Node.js

Prometheus

React

Terraform

Go

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com