Senior Site Reliability Engineer

🕒 March 13

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of ClickHouse

ClickHouse

51 - 200 employees

Founded 2016

☁️ SaaS

🏢 Enterprise

🤖 Artificial Intelligence

SaaS • Enterprise • Artificial Intelligence

ClickHouse is a fast and resource-efficient real-time data warehouse and open-source database that is designed to deliver superior query performance for mission-critical and time-sensitive applications. It is available as a cloud service on major platforms like AWS, GCP, and Azure, with a "Bring Your Own Cloud" option and a wide range of integrations for seamless operation within diverse tech stacks. ClickHouse excels in real-time analytics, machine learning, business intelligence, and observability, making it an ideal choice for tasks such as financial services, fraud detection, and gaming analytics. It supports developer-friendly SQL operations, offers cost-effective storage solutions, and provides an open-source alternative to traditional databases. Companies like Sony, Lyft, Cisco, GitLab, and Twilio leverage ClickHouse for its scalability, efficiency, and ease of use.

📋 Description

• Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse. • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud. • Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane, ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents. • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers. • Continuously improve the reliability and performance of our ClickHouse services. • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities. • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

🎯 Requirements

• Bachelor’s or Master’s degree in Computer Science or a related field. • At least 8 years of experience in Site Reliability Engineering or a related field. • Hands-on experience with Go and/or Python. • Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform. • Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus. • Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm. • Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet. • You are a strong problem solver and have solid production debugging skills. • You are passionate about efficiency, availability, scalability, and data governance. • You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward. • You have a high level of responsibility, ownership, and accountability. • Excellent communication and interpersonal skills.

🏖️ Benefits

• Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries. • Healthcare - Employer contributions towards your healthcare. • Equity in the company - Every new team member who joins our company receives stock options. • Time off - Flexible time off in the US, generous entitlement in other countries. • A $500 Home office setup if you’re a remote employee. • Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites.

Apply Now

Similar Jobs

🕒 March 3

Yelp

1001 - 5000

Site Reliability Engineer specializing in Kafka, managing Yelp’s data streaming infrastructure. Collaborating on projects to ensure the reliability and performance of critical services across hybrid and multi-cloud environments.

Apache

Cloud

Java

Kafka

Linux

Python

🕒 February 26

S&P Global

10,000+ employees

💸 Finance

🏢 Enterprise

🤖 Artificial Intelligence

DevOps Engineer focusing on infrastructure and applications supporting valuations and trade data at S&P Global. Collaborating with Development, Testing and Client Services teams to improve service availability.

AWS

Chef

Cloud

DynamoDB

EC2

Java

JavaScript

Linux

MySQL

NoSQL

PHP

Postgres

Puppet

Python

SQL

Terraform

Unix

🕒 February 20

Modaxo

1001 - 5000

🚗 Transport

☁️ SaaS

🤝 B2B

DevOps Engineer managing and scaling cloud infrastructure and services for a global technology organization. Collaborating with IT teams across multiple regions to ensure operational excellence.

AWS

Azure

Cloud

DNS

Firewalls

Linux

MacOS

Terraform

🕒 February 18

S&P Global

10,000+ employees

💸 Finance

🏢 Enterprise

🤖 Artificial Intelligence

DevOps Engineer developing functional systems that improve customer experience for S&P Global's applications. Responsibilities include automation, monitoring and maintaining infrastructure using cutting-edge technologies.

AWS

Chef

Cloud

DynamoDB

EC2

Java

JavaScript

Linux

MySQL

NoSQL

PHP

Postgres

Puppet

Python

SQL

Terraform

Unix

🕒 February 4

Vantage

51 - 200

☁️ SaaS

🤝 B2B

🛍️ eCommerce

Senior Site Reliability Engineer ensuring reliability and performance of Vantage’s services while collaborating across teams. Engaging in incident response and driving infrastructure improvements.

Ansible

AWS

Azure

Python

Terraform