Site Reliability Engineer

Job not on LinkedIn

October 16

Apply Now
Logo of Tecsys Inc.

Tecsys Inc.

Healthcare • SaaS • Logistics

Tecsys Inc. is a leading provider of supply chain management software and services designed to streamline operations in various industries. Known for its expertise in Warehouse Management Systems (WMS), Tecsys serves sectors including healthcare, distribution, 3PL, retail, and e-commerce. The company's Elite and Omni platforms offer comprehensive solutions for inventory management, transportation management, and order fulfillment. With a focus on healthcare supply chain integration, Tecsys helps organizations achieve high efficiency, cost savings, and improved patient care through innovative technology solutions.

501 - 1000 employees

Founded 1983

☁️ SaaS

📋 Description

• Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews. • Innovate relentlessly: Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform. • Maintain services once they are live by measuring and monitoring availability, latency and overall system health. • Own observability: Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes. • Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems. • Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity. • Be on-call. • Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience. • Implement monitoring, Logging, alerting, and SLA Reporting. • Create and maintain technical documentation. • Implement, maintain and mature SRE best practices. • Lead incidents: Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration. • Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth. • Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment. • Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users.

🎯 Requirements

• 5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments. • Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure. • Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale. • Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar). • Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable). • Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards. • Experience with incident management, on-call participation, escalation, and structured postmortems. • Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics. • Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned. • Experience with Fedramp (The Federal Risk and Authorization Management Program) compliance is a strong asset. • Basic knowledge of Java- or .Net-based development required. • Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec. • **Additional requirements:** • Escalation on-call rotation • Occasional travel (quarterly offsites, conferences – less than 10%)

Apply Now

Similar Jobs

October 15

Brinqa

51 - 200

🔒 Cybersecurity

Senior DevOps Engineer building risk management solutions for a cybersecurity platform. Collaborating with development teams to optimize CI/CD and cloud infrastructure.

October 15

BrightOrder Inc.

51 - 200

🚗 Transport

☁️ SaaS

📡 Telecommunications

DevOps Developer responsible for backend development and DevOps practices on AWS platform. Collaborating with teams to maintain cloud-native services and infrastructure in a remote-first environment.

October 14

Cerebras Systems

201 - 500

🤖 Artificial Intelligence

🔧 Hardware

⚕️ Healthcare Insurance

Sr. Deployment Engineer building and operating AI inference clusters for Cerebras Systems. Working with the world's largest AI chip to ensure scalable delivery of AI workloads.

October 9

Masabi

201 - 500

🚗 Transport

☁️ SaaS

Senior Site Reliability Engineer managing infrastructure and improving reliability at Masabi. Leading systems design and development, focusing on automation and performance.

October 7

Veeva Systems

1001 - 5000

☁️ SaaS

⚕️ Healthcare Insurance

💊 Pharmaceuticals

Senior Site Reliability Engineer on Vault Platform ensuring scalability and reliability of enterprise applications at Veeva. Tackling complex challenges leveraging Java and open-source technologies for global customers.

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com