Senior Site Reliability Engineer

Job not on LinkedIn

🕒 April 1

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Prima Power

Prima Power

1001 - 5000 employees

🚀 Aerospace

Aerospace • Automotive • Manufacturing

Prima Power is a leading provider of high-performance machines and automated solutions designed to enhance productivity in sheet metal working. They specialize in advanced technologies such as laser cutting, punching, and robotic solutions, providing a modular approach to manufacturing that integrates seamlessly into clients' production processes. With a customer-focused philosophy, Prima Power aims to support businesses across various industries by improving production efficiency and capabilities.

📋 Description

• Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs, working directly on production infrastructure, and collaborating closely with software engineers on system design and reliability improvements • Actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR, participate in and lead incident response, and drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling • Continuously analyze and optimize system performance and cost, provide data, insights, and recommendations to inform capacity planning, and support security best practices through hands-on vulnerability remediation and threat mitigation

🎯 Requirements

• SRE & Cloud Engineering: Hands-on experience with SRE practices in production, strong AWS expertise, Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus) • Automation & Software Engineering: Demonstrate strong software engineering fundamentals with an emphasis on code quality and maintainability. This includes solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging) and a consistent focus on writing clean, well-structured, and maintainable code • Reliability, Data & Operations: add stakeholder engagement and mentoring e.g. lead incident response and RCAs, improve system reliability, and engage stakeholders to propose solutions, share learnings, and mentor others • Nice-to-Have: Experience operating in highly regulated industries (e.g. Insurance, Banking, Healthcare), managing sensitive data, and supporting secure networking setups, including exposure to security technologies such as Cloudflare. • Strong understanding of microservices architectures, their principles and trade-offs, with the ability to troubleshoot and maintain distributed systems and supporting technologies (RabbitMQ, Kafka, PostgreSQL, Redis). • Hands-on experience with Datadog for platform and application monitoring, performance optimisation, and solid fundamentals in database structures and operational troubleshooting. Hands-on experience with PySpark and familiarity with MLOps practices including model registries, versioning, retraining workflows, and deployment lifecycles.

🏖️ Benefits

• Work Your Way: Enjoy full flexibility – work from home, the office or a mix of both. Plus, work from anywhere for up to 30 days a year. • Grow with us: Get access to learning resources, mentorship and a growth plan tailored to you. • Thrive and perform: Enjoy private healthcare, gym discounts, wellbeing programs and mental health support.

Apply Now

Similar Jobs

🕒 April 1

Fortyx

1 - 10

Site Reliability Engineer optimizing reliability, scalability, and performance for Luupli's AWS cloud infrastructure. Collaborating with teams to enhance automation and incident management.

AWS

Cloud

EC2

Python

Terraform

🕒 March 31

RemoteStar

11 - 50

🤝 B2B

🎯 Recruiter

☁️ SaaS

Senior Site Reliability Engineer Manager ensuring infrastructure and service reliability. Leading SRE team and driving operational excellence in a B2B diamond marketplace.

AWS

Azure

Cloud

Google Cloud Platform

Grafana

Prometheus

Python

Go

🕒 March 31

Keywords Studios

10,000+ employees

🎮 Gaming

📱 Media

🤖 Artificial Intelligence

Azure DevOps Engineer supporting Azure services for Keywords Group in the global Video Game Industry. Managing cloud solutions and leading projects in a remote environment.

AWS

Azure

Cloud

SQL

🕒 March 31

Whitespace Software

51 - 200

🔌 API

💸 Finance

Senior DevOps Engineer at WhiteSpace Technology managing cloud provisioning and high availability. Collaborating with developers and implementing CI/CD while ensuring system hardening and security.

Ansible

Cloud

Grafana

Prometheus

Python

🕒 March 31

Tartan Social

1 - 10

🤝 B2B

🛍️ eCommerce

DevOps Engineer responsible for building and maintaining PlaidCloud on Kubernetes with automation processes and deployment strategies. Ensures high availability and efficient resource usage for customer deployments.

Apache

Cloud

Firewalls

Greenplum

Jenkins

Kubernetes

Linux

Python

RabbitMQ

Redis

Unix