Site Reliability Lead

🕒 May 26

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Arbor Education

Arbor Education

51 - 200 employees

📚 Education

🤝 B2B

💰 Private Equity Round on 2020-12

Education • B2B • Software

Arbor Education is a rapidly expanding company dedicated to transforming the way schools operate by freeing staff from administrative tasks and enhancing collaboration. Utilizing a Management Information System (MIS), Arbor Education provides tools to improve school processes and educational outcomes for over 5,000 schools. Their mission-driven team, consisting of ex-teachers, education technology engineers, and specialists, is passionate about providing effective solutions to improve the educational sector. Founded in 2011, Arbor Education is driven by a commitment to innovation and the well-being of teachers and students, along with a dedication to diversity and inclusion in its workforce.

📋 Description

• Define and guide system architecture, balancing trade-offs between speed, scalability, maintainability, and security to meet business goals. • Champion accountability from design through to production by ensuring systems are observable and meet agreed Service Level Objectives (SLOs). Drive continuous improvement in platform reliability, performance, and efficiency. • Lead Root Cause Analysis (RCA) when issues occur and contribute to optimizing the incident response process and framework. • Drive automation initiatives across the team to reduce operational toil and improve system efficiency. • Uphold coding standards, promote automated testing, and work with the architecture community to drive technology adoption and share best practices across teams. Ensure production readiness standards for all services. • Lead technical estimation and feasibility assessments, ensuring plans are realistic and aligned with team capacity. Contribute to structured release planning and support post-release reviews. • Mentor and coach engineers through constructive feedback, knowledge sharing, and motivation. Foster alignment and help the team galvanise around technical solutions and goals. • Work closely with Product Managers, Engineering Managers, and other engineers to align technical direction with product strategy. Communicate complex technical concepts clearly to both technical and non-technical stakeholders.

🎯 Requirements

• Extensive professional experience in SRE, DevOps, or Platform Engineering on complex, scalable systems. • Extensive expertise with AWS and distributed cloud architectures. • Proven experience operating platforms serving a high volume of requests (~1000 req/sec). • Advanced proficiency with Terraform and configuration management tools. • Strong skills in Python, Go, or a similar language for automation and tooling. • Deep experience with monitoring and observability platforms (e.g., DataDog, Prometheus, or equivalent), plus incident/problem management. • Expert understanding of distributed systems, microservices, and resilience patterns. • Hands-on experience with containerization and orchestration technologies (Docker, Kubernetes, ECS). • Practical experience with building and maintaining CI/CD pipelines for automated deployments. • Demonstrated ability in mentoring and supporting the growth of fellow engineers. • Bonus Skills • Experience with chaos engineering and reliability testing. • Knowledge of security best practices and compliance frameworks. • Background in agile and lean methodologies (Scrum/Kanban). • Contributions to open-source projects or the SRE community.

🏖️ Benefits

• A dedicated wellbeing team who champion initiatives such as mindfulness, lunch n learns, manager training, mental health first aid training and much more! • 32 days holiday (plus Bank Holidays). This is made up of 25 days annual leave plus 7 extra company wide days given over Easter, Summer & Christmas • Life Assurance paid out at 3x annual salary • Comprehensive wellness benefit provided by AIG Smart Health, which provides a 24/7 virtual GP service, Mental health support, Counselling, and personalised Health Checks • Private Dental Insurance with Bupa • Salary sacrifice Pension provided by Scottish Widows • Enhanced maternity and adoption leave (20 weeks full pay) and paternity (6 weeks full pay) pay • 5 free return to work maternity coaching sessions, helping you adapt to this new exciting time of life! • Access to services such as Calm and Bippit (financial wellbeing coaching) • All of our roles champion flexible working and we are happy to discuss what this means to you • Social committees that plan team, office and company wide events to bring people together and celebrate success • Dedicated professional development training budget (CPD courses, upskilling resources, professional memberships etc) • Volunteer with a charity of your choice for a day each year • Dog friendly offices!

Apply Now

Similar Jobs

🕒 May 25

Siteup

1 - 10

DevOps Engineer responsible for managing cloud infrastructures in AWS at Cloudary. Collaborating with technical teams to optimize performance and security standards.

AWS

Cloud

Docker

EC2

Flux

Grafana

Jenkins

Kubernetes

Node.js

Prometheus

Python

Terraform

🕒 May 22

Flosum

201 - 500

🤝 B2B

☁️ SaaS

Salesforce DevOps Evangelist promoting Flosum's DevOps and data management platform. Creating engaging content and speaking at major Salesforce events to enhance brand recognition and community involvement.

🕒 May 21

Atos

10,000+ employees

🔒 Cybersecurity

DevOps Engineer at Atos Group providing cloud lifecycle support and coding solutions. Collaborating on technical projects and mentoring engineering teams for effective delivery.

AWS

Azure

Cloud

Cyber Security

Docker

Kubernetes

🕒 May 20

Nearform

201 - 500

🤝 B2B

🏢 Enterprise

☁️ SaaS

DevOps Technical Lead at Nearform responsible for leading development teams and translating client requirements into applications. Collaborating with clients to deliver impactful solutions.

AWS

Cloud

🕒 May 19

Ping Identity

1001 - 5000

🔒 Cybersecurity

☁️ SaaS

🏢 Enterprise

AWS

Cloud

Distributed Systems

Docker

Google Cloud Platform

Kubernetes

Terraform