Senior Site Reliability Engineer

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Airalo

Airalo

51 - 200 employees

📡 Telecommunications

Telecommunications • Technology • Travel

Airalo is the world's first eSIM store that provides digital SIM cards (eSIMs) to travelers in over 200 countries and regions globally. The company offers an innovative solution to avoid high roaming charges by allowing users to purchase and activate eSIMs via its app, ensuring instant connectivity without the need for physical SIM cards. Airalo caters to various travelers by providing local, regional, and global eSIMs with transparent, prepaid pricing plans, supported by 24/7 customer service. The platform also supports partnership through APIs and offers incentives such as referral credits. Airalo represents a modern approach to global mobile connectivity, making it an essential tool for frequent travelers.

📋 Description

• Lead the design of scalable, fault-tolerant and self-healing systems in a multi-region AWS environment. • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies. • Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures. • Identify patterns of manual work and lead the development of internal tools/automation to permanently eliminate them. • Develop and maintain automated runbooks and playbooks for common operational tasks and complex incident response. • Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights. • Proactively identify and mitigate operational risks through chaos engineering and architecture reviews. • Work with software engineers to design systems for reliability, scalability, and maintainability from the early stages of the SDLC. • Continuously evaluate and optimize system performance, capacity, and cost efficiency. • Beyond just participating, you will refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health.

🎯 Requirements

• Bachelor’s degree in Computer Engineering or a similar discipline. • 5+ years of experience as a Site Reliability Engineer or in a similar role. • 3+ years of experience with AWS services including strong knowledge of container orchestration. • 2+ years of Kubernetes experience. • Deep understanding of observability principles and tools such as: Prometheus, Datadog, OpenTelemetry and similar. • Experience with leading incident management and complex postmortem analysis. • Experience and interest in managing infrastructure as code (Terraform). • Experience with chaos engineering and other techniques for testing system resilience. • Experience with CI/CD tools such as GitHub Actions for automated delivery. • Proficiency in at least one programming language (Python, Go, Java, etc.) for building automation and internal tooling. • Event-driven architecture experience (SNS, SQS etc). • Ability to work independently and collaboratively in a fast-paced environment. • Team player and open to new ideas. • Good communication skills and fluency in English.

🏖️ Benefits

• Remote work • Generous PTO • Wellness allowances • Learning allowances • Annual Airalo Away retreat

Apply Now

Similar Jobs

🕒 Yesterday

Devoteam

5001 - 10000

🤖 Artificial Intelligence

🔒 Cybersecurity

Cloud Engineer (AWS) focusing on DevOps for a European consulting firm with remote work from Spain. Responsible for maintaining real-time data flows and cloud application connectivity.

🗣️🇪🇸 Spanish Required

Apache

AWS

Grafana

Kafka

🕒 June 19

Devoteam

5001 - 10000

🤖 Artificial Intelligence

🔒 Cybersecurity

Google Workspace Deployment Engineer at Devoteam planning and executing Google Workspace deployments for clients. Responsible for management and user adoption strategies in cloud environments.

Cloud

🕒 June 19

QAD

1001 - 5000

🏢 Enterprise

☁️ SaaS

Senior Site Reliability Engineer at Redzone, ensuring reliability and performance of mission-critical services. Evolving SRE practices while driving automation and operational excellence within the team.

Distributed Systems

🕒 June 18

Tempo Software

201 - 500

☁️ SaaS

🏢 Enterprise

⚡ Productivity

Site Reliability Engineer at Tempo working on infrastructure to support various global engineering products. Collaborating with teams and ensuring high availability and performance standards.

Ansible

AWS

Cloud

Docker

Java

Kotlin

Kubernetes

Linux

Terraform

🕒 June 18

Unit4

1001 - 5000

🏢 Enterprise

☁️ SaaS

🤖 Artificial Intelligence

Cloud Operations Engineer in a fast-growing cloud company focusing on redefining ERP. Role involves solving customer issues and learning market-relevant skills in a collaborative environment.

Azure

SMTP

SQL