Site Reliability Engineer

Yesterday

Apply Now
Logo of CrowdStrike

CrowdStrike

Cybersecurity • SaaS • Artificial Intelligence

CrowdStrike is a cybersecurity company that provides cloud-based security services to stop breaches. It is recognized as a leader in endpoint protection, identity and cloud security, and managed detection and response. CrowdStrike's platform, Falcon, integrates artificial intelligence to offer real-time visibility, detection, and protection against sophisticated cyber threats. The company is lauded for its effectiveness in securing networks and data, making it a trusted partner for businesses worldwide.

5001 - 10000 employees

Founded 2011

🔒 Cybersecurity

☁️ SaaS

🤖 Artificial Intelligence

📋 Description

• Ensure Platform Reliability: Own the availability, latency, performance, and efficiency of NG-SIEM platform services handling >100 PB/day of data ingestion and millions of queries per hour • Build Automation & Tooling: Design and implement automation solutions for deployment, monitoring, incident response, and capacity planning to reduce toil and improve operational efficiency • Monitor & Optimize: Develop comprehensive observability solutions using metrics, logs, and traces; proactively identify and resolve performance bottlenecks and reliability issues • Incident Management: Lead incident response efforts, conduct blameless post-mortems, and drive continuous improvement initiatives to prevent recurrence • Capacity Planning: Analyze system performance data and growth trends to forecast infrastructure needs and ensure the platform scales efficiently with customer demand • SLO/SLA Management: Define, measure, and maintain Service Level Objectives and error budgets; balance feature velocity with reliability requirements • Cost Optimization: Implement strategies to optimize cloud resource utilization and reduce operational costs while maintaining performance and reliability standards • Collaborate Cross-Functionally: Partner with engineering teams to improve system design for reliability, influence architectural decisions, and embed SRE best practices • On-Call Participation: Participate in on-call rotation to provide 24/7 support for critical production systems • Documentation: Create and maintain runbooks, operational procedures, and technical documentation to enable team scalability

🎯 Requirements

• Experience in Site Reliability Engineering, DevOps, or similar roles supporting large-scale distributed systems in production environments • Strong programming skills in at least one language (Go) for automation and tooling development • Deep cloud expertise with hands-on experience in at least one major cloud platform (AWS or GCP), including compute, storage, networking, and managed services • Distributed systems knowledge: Understanding of distributed system design patterns, consistency models, fault tolerance, and scalability principles • Infrastructure as Code: Proficiency with IaC tools (Terraform) and configuration management (Ansible, Chef, Puppet) • Container orchestration: Experience with Kubernetes, Docker, Podman and container-based deployment patterns • Observability expertise: Hands-on experience with monitoring and observability tools (Prometheus, Grafana) • CI/CD pipelines: Experience building and maintaining continuous integration and deployment pipelines • Incident management: Proven track record of managing high-severity incidents and implementing preventive measures • Data-driven approach: Ability to analyze system metrics and logs to identify trends, anomalies, and optimization opportunities • Communication skills: Excellent verbal and written communication abilities for remote collaboration across global teams • Bonus Points: Massive scale experience: 3+ years owning systems handling over 1 trillion requests per day or more than 10 PB of data per day • Multi-cloud experience: Hands-on work with hybrid or multi-cloud environments • Database expertise: Deep knowledge of distributed databases, data lakes, or SIEM platforms (ClickHouse, Redis, MySQL) • Security background: Exposure to cybersecurity, threat intelligence, or security operations • Networking expertise: Advanced understanding of network protocols, load balancing, and CDN technologies

🏖️ Benefits

• Remote-friendly and flexible work culture • Market leader in compensation and equity awards • Comprehensive physical and mental wellness programs • Competitive vacation and holidays for recharge • Paid parental and adoption leaves • Professional development opportunities for all employees regardless of level or role • Employee Networks, geographic neighborhood groups, and volunteer opportunities to build connections • Vibrant office culture with world class amenities • Great Place to Work Certified™ across the globe

Apply Now

Similar Jobs

2 days ago

Landbot

51 - 200

🤖 Artificial Intelligence

🤝 B2B

☁️ SaaS

Senior Reliability Engineer at Landbot optimizing cloud resources and building internal developer tools. Collaborating with application teams to enhance platform reliability and developer experience.

🇪🇸 Spain – Remote

💰 $8M Series A on 2021-01

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇪🇸 Spanish Required

3 days ago

Peek Vision

11 - 50

Senior DevOps Engineer delivering impactful solutions for life-changing technology with Peek Vision. Join an award-winning team improving access to eye care for underserved communities.

🇪🇸 Spain – Remote

💵 €65k - €75k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

3 days ago

IRIUM

501 - 1000

🔒 Cybersecurity

☁️ SaaS

🇪🇸 Spain – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🗣️🇪🇸 Spanish Required

November 21

Monster

1001 - 5000

Release Engineer joining SUSE to manage software-defined infrastructures at scale, working with an international team to innovate solutions. Contributing to releases and coordinating across teams.

🇪🇸 Spain – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

November 21

Eurofins

10,000+ employees

🔬 Science

🧬 Biotechnology

⚕️ Healthcare Insurance

IT Deployment Engineer providing first and second line support in the Eurofins diagnostics network. Training users and supporting the Laboratory Information Management System.

🇪🇸 Spain – Remote

💰 $30M Grant on 2021-10

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com