Senior Site Reliability Engineer – SRE

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of QAD

QAD

1001 - 5000 employees

Founded 1979

🏢 Enterprise

☁️ SaaS

Enterprise • SaaS • Supply Chain

QAD is a company specializing in enterprise resource planning and industrial transformation solutions. Their Adaptive Enterprise platform helps businesses optimize processes, align people with technology, and manage critical business challenges. QAD's offerings include software for manufacturing, inventory management, supply chain planning, quality management, and global trade compliance. Their solutions serve a range of industries, including automotive, consumer products, food and beverage, industrial manufacturing, and more. The company focuses on becoming an adaptive enterprise by integrating advanced scheduling and data-driven insights.

📋 Description

• Drive Operational Excellence: Design, implement, and maintain highly available, scalable, and resilient systems that deliver exceptional customer experience • Datadog Expert: Be one of the go-to experts for Datadog, responsible for defining and implementing best practices • Software Development for Reliability: Develop robust, well-tested, and maintainable software to automate operational tasks • Toil Reduction Champion: Identify and eliminate toil through automation and process improvements • Incident Management & Post-Mortems: Lead blameless post-mortems and contribute to incident response framework • Reliability Metrics & Goals: Collaborate to define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets • Infrastructure as Code: Leverage and contribute to infrastructure as code efforts • System Design & Architecture: Provide SRE expertise in system design reviews • Knowledge Sharing & Mentorship: Document processes and share expertise with team

🎯 Requirements

• Demonstrated experience operating and improving production systems at scale in an SRE, Production Engineering, or Platform Engineering role • Proven ability to rapidly build accurate mental models of complex distributed systems across infrastructure, applications, networking, identity, and observability domains • Strong troubleshooting skills with a methodical, evidence-driven approach to incident response and root cause analysis • Experience defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to guide reliability decisions • Excellent written and verbal communication skills, with the ability to explain complex technical issues clearly to both technical and non-technical audiences

🏖️ Benefits

• Flexible work arrangements • Professional development opportunities • Continuous improvement culture • Mentorship opportunities

Apply Now

Similar Jobs

🔥 12 hours ago

Tempo Software

201 - 500

☁️ SaaS

🏢 Enterprise

⚡ Productivity

Site Reliability Engineer at Tempo working on infrastructure to support various global engineering products. Collaborating with teams and ensuring high availability and performance standards.

Ansible

AWS

Cloud

Docker

Java

Kotlin

Kubernetes

Linux

Terraform

🔥 23 hours ago

Unit4

1001 - 5000

🏢 Enterprise

☁️ SaaS

🤖 Artificial Intelligence

Cloud Operations Engineer in a fast-growing cloud company focusing on redefining ERP. Role involves solving customer issues and learning market-relevant skills in a collaborative environment.

Azure

SMTP

SQL

🕒 6 days ago

knowmad mood

1001 - 5000

🤝 B2B

🏢 Enterprise

Fullstack PHP/React Developer for pharmaceutical client in Spain. Leading projects, coordinating teams, and ensuring solution implementation.

🗣️🇪🇸 Spanish Required

JavaScript

Jenkins

Kubernetes

OpenShift

PHP

React

ServiceNow

🕒 June 10

knowmad mood

1001 - 5000

🤝 B2B

🏢 Enterprise

Fullstack PHP/React Developer for remote projects in the pharmaceutical sector at knowmad mood. Leading a team and ensuring technical solutions in a stable environment.

🗣️🇪🇸 Spanish Required

JavaScript

Jenkins

Kubernetes

OpenShift

PHP

React

ServiceNow

🕒 June 3

Factorial

501 - 1000

👥 HR Tech

☁️ SaaS

🏢 Enterprise

Sr. Cloud DevOps Engineer managing cloud solutions at ENCAMINA. Leading innovative projects in Azure and AI, ensuring security and reliability across platforms.

🗣️🇪🇸 Spanish Required

Azure

Docker

Grafana

Kubernetes

Prometheus

Splunk

Terraform

Unity