Senior Consultant – Site Reliability Engineering

🕒 April 29

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Fabric Group

Fabric Group

51 - 200 employees

Founded 2006

🤝 B2B

🏢 Enterprise

🤖 Artificial Intelligence

B2B • Enterprise • Artificial Intelligence

Fabric Group is a consulting and software engineering firm that helps organisations with product strategy, design thinking, custom software delivery, digital transformation, and ongoing product operations. The company combines strategy, design, and engineering expertise to consult on organisational strategy, build tailored software (including AI-based solutions), and transform and operate products for clients across transport, logistics, retail, infrastructure and government sectors. Fabric Group operates across APAC with multiple locations and a team of specialised consultants focused on driving measurable business outcomes.

📋 Description

• Consultative Ownership: Work with autonomy to own problems and deliver solutions, acting as a bridge between development and operations. • Observability Architecture: Design and implement robust monitoring solutions using the LGTM stack to ensure system health and performance. • Reliability Strategy: Advise clients on defining meaningful SLOs/SLIs and managing error budgets to balance innovation with stability. • AI Assistance: Drive use of AI Agents or AI tools for intelligent automation and improving operational efficiency. • Incident Leadership: Lead post-incident reviews (Blameless Post-Mortems) to identify systemic improvements and reduce future toil. • Mentorship: Coach less experienced engineers within Fabric and our client teams on SRE principles and modern infrastructure patterns. • Advising our clients on the right technical decisions and advocating for the right practices to use. • Being an ambassador for Fabric, promoting our values and the practices we use to make sure we build the software right. • Participate in interviewing and recruitment based on business needs. • Thought Leadership: Contribute to the SRE community through blog posts, meetups, or internal knowledge sharing. • Operational Support & Availability: Rotational Support Coverage: Participate in a sustainable team rotation to provide extended service coverage (including weekends) for business-critical systems. • Incident Response: Act as a primary responder for high-priority (P1/P2) incidents during your rostered shift, focusing on rapid restoration and clear stakeholder communication.

🎯 Requirements

• Strong expertise in Observability: Deep comfort with Grafana, including the LGTM stack (Loki, Grafana, Tempo, Mimir) or Grafana Cloud, OpenTelemetry. • Container Orchestration: Solid experience with Kubernetes management, configuration, and troubleshooting in production. • Good understanding of AI Agent frameworks and tools like Grafana AI Assistant. • Cloud Proficiency: Hands-on experience with GCP or AWS, including networking, security, and cloud-native services. • Modern Deployment: Proven experience implementing GitOps (ArgoCD) and CI/CD pipelines (GitLab CI, GitHub Actions, etc.). • Infrastructure as Code (IaC): Experience with tools like Terraform. • Automation & Scripting: Proficiency in at least one language (e.g., Python, Go, or Bash) for building tooling and automating operational tasks. • Incident Management: Experience with on-call rotation tools (Grafana on-call, Opsgenie) and a strong commitment to a blameless culture.

🏖️ Benefits

• A variety of business domains to dive into including retail, finance, construction and logistics • Creating innovative custom products to solve complex problems that existing solutions just can’t • Collaborating with a team of top notch professionals who are obsessed with value, latest tech and the right way to build a digital product • Ability to switch projects every 6-12 months to keep you challenged, excited and growing • Strong support network from the delivery community of practice, leadership and our tech teams to help you address any client challenges you may face • Very diverse and inclusive environment where people value feedback, connections and collaboration in a workspace • Enjoy the freedom of a fully remote lifestyle, where you can ditch the commute and deliver high-impact work from the comfort of your own home.

Apply Now

Similar Jobs

🕒 April 10

Davion Labs

51 - 200

₿ Crypto

🌐 Web 3

DevSecOps Engineer focusing on cloud security and automation in the context of CI/CD pipelines. Managing security architecture, vulnerability assessments, and incident responses.

🗣️🇨🇳 Chinese Required

AWS

Python

Go