Senior Site Reliability Engineer

November 5

Apply Now
Logo of Prolific

Prolific

AI • B2B • Research

Prolific is a platform that facilitates fast and high-quality data collection by connecting researchers with participants, including AI taskers. Researchers can launch tasks, surveys, and experiments and receive responses from a global community of over 200,000 active participants within hours. Prolific prides itself on providing accurate and detailed data while ensuring that participants are fairly rewarded for their contributions. The platform is trusted by leading academics and organizations for its flexibility, simplicity, and rapid project execution.

51 - 200 employees

Founded 2014

🤝 B2B

📋 Description

• Develop and maintain highly available infrastructure using modern infra-as-code techniques, with a focus on terragrunt and terraform. • Manage and optimise Kubernetes clusters and their workloads with a focus on reliability and performance. • Participate in incident response and remediation, working with relevant product teams and stakeholders to resolve production issues efficiently, including creating and maintaining runbooks. • Review and optimise other areas of our tooling stack, such as CICD or release strategies. • Foster a culture of continuous improvement, such as enhancing documentation and upskilling teams in cloud architecture and kubernetes. • Improve observability and alerting systems across our application and infrastructure, ensuring proactive detection of system degradation. • Collaborate with Engineering teams to foster an SRE culture, including contributing defining SLO’s, SLA’s and error budgets. • Design and implement automation strategies to ensure managed services remain up-to-date, secure, and performant. • Lead and support initiatives that automate processes to improve system efficiency, resilience and reduce toil. • Organising, supporting and responding to on-call incidents

🎯 Requirements

• 5+ years with Google Cloud Platform, GKE, and the Kubernetes ecosystem with experience with Terraform and Terragrunt • Strong programming skills in Python • Strong experience in observability principles and tooling • Experience in GitOps flows and platforms for Kubernetes, such as ArgoCD • Deep understanding of system architecture and scalability principles • Strong collaboration and communication skills to work with cross-functional teams.

🏖️ Benefits

• Benefits • External Handbook • Website • Youtube

Apply Now

Similar Jobs

November 5

Trimble Inc.

10,000+ employees

Site Reliability Engineer supporting Trimble’s Core Cloud Platform with responsibilities in infrastructure as code and service reliability. Engage with cross-functional teams to enhance monitoring and incident management.

AWS

Azure

Cloud

Grafana

Jenkins

Kubernetes

Prometheus

Python

Splunk

Terraform

October 30

DevOps Team Lead at Runware optimizing infrastructure for AI delivery. Leading automation and reliability in systems design and operation for performance enhancement.

Distributed Systems

Docker

Kubernetes

Python

Go

October 24

Site Reliability Engineer ensuring performance and scalability for energy platform at Kraken. Collaborating with product teams for optimal product performance and reliability improvements.

October 24

Site Reliability Engineer at Circle designing and operating blockchain infrastructure. Collaborating with teams to enhance system reliability and performance for a fast-growing platform.

Kubernetes

Python

SQL

Go

October 23

Site Reliability Engineer ensuring system reliability and performance for open-source blockchain projects at IOHK. Involves service operations, engineering principles, and collaborative project engagement.

AWS

Grafana

Kubernetes

Postgres

Prometheus

Python

Terraform

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com