Senior Site Reliability Engineer, Environment Automation

August 26

Apply Now
Logo of GitLab

GitLab

Artificial Intelligence • Enterprise • SaaS

GitLab is the most comprehensive AI-powered DevSecOps platform, offering tools for automated software delivery, security, and compliance throughout the software development lifecycle. It provides solutions across areas such as AI-assisted development, continuous integration/continuous deployment (CI/CD), source code management, and vulnerability management. GitLab aims to simplify and accelerate software delivery by uniting development, security, and operations on a unified platform. It is particularly recognized for its AI code assistants and has been named a leader in the Gartner Magic Quadrant™ for DevOps Platforms, making it a preferred choice for many enterprises.

1001 - 5000 employees

Founded 2014

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

💰 Secondary Market on 2020-11

📋 Description

• Keep all user-facing services and production systems reliable, scalable, and efficient as a Site Reliability Engineer (SRE) at GitLab. • Operate and automate hundreds of GitLab environments—from initial provisioning to day-to-day maintenance tasks—in the Environment Automation specialization. • Design infrastructure automation that provisions and operates GitLab environments using Terraform, Ansible, and Kubernetes; create and maintain deployment packages such as Helm Charts and omnibus-gitlab. • Build and operate Dedicated GitLab instances integrated with cloud-native services (e.g., GCP, AWS) and integrate with cloud provider ecosystems (IAM, networking, storage). • Develop tools to orchestrate infrastructure-as-code workflows across multiple tenants and deploy/manage microservices on Kubernetes clusters at scale. • Enhance observability stack (e.g., Prometheus, ELK) for proactive monitoring and incident response; build systems to detect bottlenecks and predict usage trends. • Lead incident response and postmortem efforts, influence architectural decisions, and collaborate with engineering teams to improve automation, resilience, and production-readiness. • Champion and implement cloud security best practices across automated infrastructure.

🎯 Requirements

• Proven ability to operate and troubleshoot production workloads across multiple tenants or environments. Deep understanding of how distributed systems fail at scale and how to build in resilience. • Strong hands-on experience with Terraform, including workspace strategies, state management, and automation patterns that scale. Comfortable solving state isolation issues and building reliable, reusable infrastructure code. Experience with Ansible and templating tools like Jsonnet is a plus. • Skilled at diagnosing deployment failures, interpreting pod logs, and debugging scheduling issues and rollback scenarios in live environments. Understands how pods, ReplicaSets, and controllers interact in production. • Ability to read and debug code in Go and/or Ruby. Familiar with identifying performance issues, scalability concerns, and contributing to infrastructure tooling through thoughtful code analysis. • Experience supporting infrastructure for many customers or environments simultaneously. Comfortable managing isolation, scaling, monitoring, and incident response across diverse workloads. • Able to reason through complex systems and operational challenges. Brings on-call experience and can lead technical discussions and incident resolution efforts under pressure. • Proven ability to work across teams and with internal or external customers to solve technical problems while maintaining service commitments and clear communication. • Comfortable using GitLab as a daily tool for infrastructure automation, collaboration, and operational workflows.

🏖️ Benefits

• Remote work • AI as a core productivity multiplier in daily workflows • High-performance culture driven by company values and continuous knowledge exchange • Equal opportunity workplace and affirmative action employer

Apply Now

Similar Jobs

August 21

Flinks

51 - 200

💳 Fintech

🏦 Banking

💸 Finance

Own end-to-end observability for Flinks’ fintech products; define SLIs/SLOs, automate alerts, and enable reliability improvements across Data Connectivity, Payments, Enrichment, and Document Services.

🇨🇦 Canada – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

August 3

Knak

51 - 200

🏢 Enterprise

☁️ SaaS

As a Senior DevOps Engineer at Knak, you'll enhance our cloud infrastructure supporting a marketing platform.

July 30

Xsolla

201 - 500

🎮 Gaming

🛍️ eCommerce

☁️ SaaS

Xsolla is looking for a DevOps Engineer to manage Kubernetes clusters and cloud infrastructure.

🇨🇦 Canada – Remote

💵 C$100k - C$150k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

July 17

OnHires

11 - 50

🎯 Recruiter

🤝 B2B

☁️ SaaS

Support the reliability and performance of high-volume iGaming services in a fully remote environment.

🇨🇦 Canada – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

June 19

Netomi

51 - 200

🤖 Artificial Intelligence

🏢 Enterprise

☁️ SaaS

As a DevOps Specialist, leverage cloud expertise to support Netomi AI's customer-focused mission.

🇨🇦 Canada – Remote

💰 $30M Series B on 2021-11

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com