Staff Site Reliability Engineer

Telecommunications • Enterprise • SaaS

Aalyria is a company dedicated to creating, organizing, and managing the world's most advanced networks to enable ubiquitous connectivity at the speed of discovery. It utilizes atmospheric laser communications technology and a software platform originally developed by Alphabet. Aalyria's platform orchestrates networks across land, sea, air, space, and beyond. Key technological components include Tightbeam, a free space optics technology, and Spacetime, a software platform for network orchestration. Aalyria is backed by significant investors and has engaged in various high-profile projects, including working with NASA and developing 5G/6G networking platforms.

51 - 200 employees

📡 Telecommunications

🏢 Enterprise

☁️ SaaS

Staff Site Reliability Engineer

November 11

🇺🇸 United States – Remote

💵 $160k - $200k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Distributed Systems

Google Cloud Platform

Grafana

Java

Kubernetes

Prometheus

Python

Terraform

Apply Now

Aalyria

Telecommunications • Enterprise • SaaS

51 - 200 employees

📡 Telecommunications

🏢 Enterprise

☁️ SaaS

📋 Description

• Design, build, and own the technical roadmap for Aalyria's centralized observability platform, integrating and scaling tools for metrics (Prometheus), logging (Loki), and distributed tracing (Tempo/OpenTelemetry) • Define, implement, and manage a robust framework of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for our core products, ensuring we are launch-ready • Establish and evangelize observability best practices, providing standards, documentation, and tooling (e.g., OpenTelemetry libraries) to empower our Go and Java application teams to instrument their services effectively • Partner with core software engineers to provide the tools and insights needed to debug performance, optimize computational pipelines (including CPU/GPU workloads), and ensure the reliability of large-scale distributed systems • Automate the deployment, scaling, and management of the entire observability stack using Infrastructure as Code (Terraform) and GitOps principles (ArgoCD) • Partner closely with the core infrastructure team to ensure deep visibility into our Kubernetes clusters and underlying GCP and AWS environments • Develop and lead the company's monitoring, alerting, and incident response strategy, driving a culture of proactive reliability and blameless post-mortems

🎯 Requirements

• 7+ years of experience in an SRE or platform engineering role • Deep, hands-on expertise building, scaling, and managing observability platforms (e.g., Prometheus, Grafana, Loki/ELK, OpenTelemetry, Tempo/Jaeger, Honeycomb, etc.) • Strong production-level experience with Google Cloud Platform (GCP) and Kubernetes • Proven mastery of Infrastructure as Code (IaC) with Terraform and GitOps principles (e.g., ArgoCD) • Proficiency in a systems programming language, with a strong preference for Go and Python for debugging and writing tooling • Demonstrable experience defining, implementing, and managing SLOs, SLIs, and error budgets for production services

🏖️ Benefits

• Competitive salary • Comprehensive benefits (401(k), dental, vision, health, life insurance) • Paid time off • Equity options • Flexible working arrangements including hybrid remote/in-office schedules

Apply Now

Similar Jobs

VP, Site Reliability Engineer

November 6

Galaxy

201 - 500

₿ Crypto

💸 Finance

Galaxy VP, Site Reliability Engineer in charge of AWS and containerized infrastructure. Focusing on automation, reliability, and cloud best practices.

🇺🇸 United States – Remote

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Cloud

Grafana

Kubernetes

Prometheus

Terraform

AWS DevOps, SAP S/4HANA SD Experience

November 5

CloudScouts

11 - 50

🤝 B2B

🏢 Enterprise

💸 Finance

AWS DevOps Engineer designing cloud-native applications for SAP S/4HANA processes. Optimizing AWS cost/performance in fully remote work environment.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

DynamoDB

Kafka

DevOps Engineer

November 5

Second Front Systems

51 - 200

☁️ SaaS

🏛️ Government

DevSecOps Engineer leading customer onboarding to the Game Warden platform for national security. Working in a collaborative environment to enhance secure deployments for government and defense.

🇺🇸 United States – Remote

💵 $135k - $160k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Google Cloud Platform

Kubernetes

Python

Terraform

AI Automations Cloud Deployment Engineer

October 31

RTX

10,000+ employees

🚀 Aerospace

AI Cloud Engineer at Raytheon Technologies leading design and optimization of scalable AI solutions on cloud platforms. Collaborating with teams to drive innovation and support mission objectives.

🇺🇸 United States – Remote

💵 $124k - $250k / year

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Docker

Google Cloud Platform

Java

Kubernetes

Python

Director of DevOps, Product Security

October 29

DDN

1001 - 5000

🤖 Artificial Intelligence

Director of DevOps and Product Security at DDN leading operational excellence across Infinia platform. Ensuring security and compliance while driving automation and scalability for AI workloads.

🇺🇸 United States – Remote

💰 $10M Funding Round on 2011-06

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

AWS

Azure

Cloud

Google Cloud Platform

Jenkins

Terraform