Senior Cloud SRE

🕒 April 8

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Hazelcast

Hazelcast

51 - 200 employees

Founded 2008

🏢 Enterprise

🤖 Artificial Intelligence

☁️ SaaS

Enterprise • Artificial Intelligence • SaaS

Hazelcast is a leading real-time data platform uniquely combining a fast data store and distributed compute engine into one system. It provides solutions for stream processing, event-driven architectures, and real-time AI/ML automation. The platform is well-suited for enterprise architectures and cloud-agnostic deployments, offering high performance, resilience, and scale. Hazelcast serves various industries such as financial services, e-commerce, and healthcare, helping organizations modernize their data architecture, improve payment processing, and enhance fraud detection. The company also integrates with Apache Kafka and Redis for improved data processing capabilities.

📋 Description

• Keep Hazelcast cloud-based production systems running smoothly 24/7/365 • Design and Development: • Design, develop, and maintain our cloud infrastructure to support both our end user management center and microservice based platform • Implement new solutions using AWS and terraform, improving scalability, throughput, and reliability. • Support and manage our Keycloak IDP ensuring it provides appropriate security while meeting the needs of the development team • Security and Integration: • Implement security measures to protect data integrity and confidentiality, including encryption, access control, and compliance with relevant regulations. • Work with our operations team to maintain our SOC2 & ISO27001 compliance, and keeping our environment secure • Monitoring and Maintenance: • Monitor the system for performance issues, errors, and potential failures, and implement maintenance procedures such as backups, data recovery, and disaster recovery plans. • Troubleshoot issues related to data storage, including performance bottlenecks, data corruption, or compatibility issues with other software components. • Collaboration: • Collaborate with cross-functional teams, including software developers, architects, and product managers, to ensure the effective integration and operation of the components within the overall software infrastructure. • Document design decisions, implementation details, and operational procedures to facilitate collaboration among team members and ensure the maintainability of the system. • Continuous Learning: • Stay updated with the latest developments in storage technologies, Java programming language, and software engineering best practices, and apply this knowledge to improve existing storage systems and develop new solutions. • On-call participation • Be part of our on-call rotation to respond to availability incidents and work with support and engineers on customer incidents

🎯 Requirements

• Experience of distributed systems, Kubernetes & microservices • Infrastructure as Code (Terraform) • Modern devops stack (K8s, Prometheus, Grafana, Opentelemetry, ArgoCD, helm) • Experience with at least one programming languages, preferably Golang or Python • Experience with CI and building CD pipelines (Jenkins, GitHub Actions) • A passion for automation and keeping our software delivery fast and efficient • Knowledge of following are desirable: • Mutli-cloud (AWS, GCP and/or Azure) • Experience working with software engineers in designing cloud-native applications or troubleshooting them • Experience as part of an on-call rota • Bachelor's degree in a relevant field of study (Computer Science, or related discipline). OR equivalent experience.

🏖️ Benefits

• 25 days annual leave • Group Company Pension Plan • Private Medical Insurance • Private Dental Insurance • Life Insurance • EAP (Employee Assistance Program)

Apply Now

Similar Jobs

🕒 April 3

Intermedia Cloud Communications

1001 - 5000

🤝 B2B

🏢 Enterprise

☁️ SaaS

Site Reliability Engineer enhancing reliability and operational metrics for cloud communication services at Intermedia. Collaborating with teams to optimize alerting and event management solutions.

Ansible

AWS

Azure

Grafana

Linux

NGINX

Prometheus

Python

RabbitMQ

Redis

VMware

VoIP

🕒 April 2

ClickHouse

51 - 200

☁️ SaaS

🏢 Enterprise

🤖 Artificial Intelligence

Database Reliability Engineer at ClickHouse ensuring reliability and performance of ClickHouse core, improving customer service through backend optimization.

AWS

Azure

Cloud

Google Cloud Platform

Python

SQL

🕒 April 1

Prima Power

1001 - 5000

🚀 Aerospace

Senior Site Reliability Engineer shaping the future of motor insurance at a leading provider. Collaborating across engineering teams to build reliable and scalable systems.

AWS

Cloud

Distributed Systems

DNS

Kafka

Kubernetes

Microservices

Postgres

PySpark

Python

RabbitMQ

Redis

Terraform

🕒 April 1

Prima Power

1001 - 5000

🚀 Aerospace

DevOps Engineer in Infrastructure team leveraging data and tech for innovative motor insurance solutions. Join over 300 engineers for impactful scalable systems.

AWS

Distributed Systems

DNS

Kubernetes

Microservices

Python

Terraform

🕒 April 1

Fortyx

1 - 10

Site Reliability Engineer optimizing reliability, scalability, and performance for Luupli's AWS cloud infrastructure. Collaborating with teams to enhance automation and incident management.

AWS

Cloud

EC2

Python

Terraform