Site Reliability Engineer

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Stack AV

Stack AV

51 - 200 employees

🚗 Transport

🤖 Artificial Intelligence

Transport • Artificial Intelligence

Stack AV is a company that is revolutionizing the transportation industry through its autonomous trucking solutions, driven by advanced artificial intelligence. The company focuses on developing AI-powered autonomous systems to enhance safety, reliability, and efficiency in trucking operations. Stack AV is committed to addressing the challenges of the trucking industry by designing smart solutions to improve supply chain intelligence, business outcomes, and delivery speed. Safety is a core principle, and the company leverages cutting-edge AI, machine learning, and cloud technologies to innovate within the industry.

📋 Description

• Instrument systems scheduling and executing large-scale batch workloads across Kubernetes clusters. • Diagnose and triage job failures for customers. • Collaborate with teams across the company to understand workload requirements and improve platform capabilities. • Scale the reliability and velocity of our systems and processes through increased automation. • Document actions to build a comprehensive library of runbooks, which will act as a knowledge base and foundation for automation. • Participate in an on-call rotation to uphold the SLOs and SLAs of production services. • Contribute to platform tooling, automation, and CI/CD workflows.

🎯 Requirements

• Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems. • Strong experience with Kubernetes and container orchestration in production grade environments. • Understanding of engineering design limitations and ability to provide guidance to teams to scale their services to achieve desired performance within budget. • Strong experience implementing and debugging cloud native and open source tools such as Kubernetes, etcd, Prometheus, OpenTelemetry. • Strong communication skills and the ability to work effectively in a diverse and distributed team.

🏖️ Benefits

• We are proud to be an equal opportunity workplace. • We believe that diverse teams produce the best ideas and outcomes. • We are committed to building a culture of inclusion, entrepreneurship, and innovation across gender, race, age, sexual orientation, religion, disability, and identity.

Apply Now

Similar Jobs

🔥 45 minutes ago

Openly

201 - 500

🏢 Enterprise

DevOps/Site Reliability Engineer II building technology infrastructure for Openly's insurance platform. Automating processes and ensuring system stability and performance.

🔥 7 hours ago

nDeavour Consulting

1 - 10

🎯 Recruiter

👥 HR Tech

🤝 B2B

Site Reliability Engineer ensuring health, performance, and delivery of infrastructure systems at Mobile Wave Solutions. Working collaboratively with engineers to automate processes and improve operational reliability.

🕒 2 days ago

Gorilla Logic

501 - 1000

☁️ SaaS

🏢 Enterprise

🤖 Artificial Intelligence

Technical Engineering Manager leading high-performing cloud and DevOps teams. Guiding architecture and delivery of scalable, reliable, and secure cloud solutions for clients.

🕒 2 days ago

Planned Systems International

1001 - 5000

🔒 Cybersecurity

🏛️ Government

DevOps Software Engineer responsible for cloud automation and software development. Supporting government research and development activities in the Health and Defense sectors.

🕒 2 days ago

Skydio

501 - 1000

🔧 Hardware

🤖 Artificial Intelligence

🔐 Security

Deployment Engineer at Skydio leading technical implementations and maintaining customer success with cloud connected products. Focused on WiFi, networking, and client communication within the Northeast region.