Observability & Operations Engineer

🕒 March 10

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Fullbay

Fullbay

51 - 200 employees

☁️ SaaS

🏢 Enterprise

🚗 Transport

💰 Venture Round on 2019-05

SaaS • Enterprise • Transport

Fullbay is a comprehensive software solution designed specifically for diesel repair shops and other heavy-duty repair operations. The company offers a range of tools to streamline various aspects of shop management, including estimates and invoices, service order workflow, inventory management, and customer communication. Fullbay integrates with accounting systems for seamless bookkeeping and supports two-way texting between customers and shops. It also provides specialized repair software for fleet maintenance, mobile operations, and specific industries such as agricultural and emergency vehicles. With features like reporting, MOTOR integration for repairs, and secure cloud-based data, Fullbay aims to maximize shop efficiency and enhance profitability while contributing to safer roads.

📋 Description

• Design and implement a comprehensive observability strategy (logging, metrics, tracing, alerting) across all AWS environments, leveraging AI-powered tools to detect anomalies and surface insights automatically • Build and manage monitoring platforms such as Datadog, Grafana, Prometheus, and AWS CloudWatch — actively exploring AI-native features within these tools to reduce alert fatigue and improve signal quality • Use AI coding assistants (e.g. GitHub Copilot, Claude) to accelerate development of dashboards, runbooks, and automation scripts • Own the incident management lifecycle — on-call rotations, post-mortems, root cause analysis — and apply AI-assisted log analysis to speed up diagnosis and resolution • Instrument Java, Kotlin, and Node.js-based cloud-native applications to emit structured logs, distributed traces, and metrics; identify opportunities to use ML-based anomaly detection in place of static thresholds • Build repeatable, code-first observability pipelines that treat dashboards, alerts, and runbooks as first-class software — versioned, tested, and deployed through Harness • Leverage AWS PaaS services (Lambda, API Gateway, ECS, RDS, SQS, SNS, and others) to build scalable, automated operational tooling • Collaborate with development teams to embed observability and AI-assisted quality checks into CI/CD pipelines via Harness • Own the FinOps function for our AWS environment — tracking cloud spend, building cost dashboards, identifying waste, and using AI-powered cost analysis tools to surface optimization opportunities and drive accountability across engineering teams • Monitor AWS infrastructure for performance, availability, and cost — partnering with finance and engineering to enforce spend governance • Develop and maintain Infrastructure as Code using Terraform, using AI pair programming to improve quality and consistency • Contribute to architectural decisions with a focus on resilience, automation, and reducing toil through intelligent systems • Adheres to all confidentiality and compliance regulations • Performs other duties as assigned

🎯 Requirements

• 7 –10 years of experience in Software Engineering, Cloud Operations, or Site Reliability Engineering • 5+ years of hands-on experience with AWS infrastructure and AWS PaaS services; certifications are a plus • Demonstrated experience building repeatable, code-first pipelines and treating operational configuration as first-class software • Experience working with polyglot environments including Java, Kotlin, and Node.js • Demonstrated experience using AI tools (coding assistants, AI-powered observability platforms, or similar) in a professional setting — we’re an AI-first company and expect this to be part of how you work, not something you’re just exploring

Apply Now

Similar Jobs

🕒 March 10

Smarkets

51 - 200

🎲 Gambling

⚽ Sports

🛍️ eCommerce

Senior Customer Operations Executive delivering support for US customer base via chat and email. Involves customer queries, operational improvement, and cross-team collaboration.

🕒 March 10

Conduent

10,000+ employees

🤝 B2B

🛍️ eCommerce

🏛️ Government

IT Operations Manager leading operations for government payment systems, ensuring compliance and stability through cross-functional collaboration. Requires extensive IT experience in regulated environments.

🕒 March 10

Commure

1001 - 5000

🤖 Artificial Intelligence

☁️ SaaS

🤝 B2B

Manager overseeing billing and collection operations at healthcare technology firm. Ensuring AR efficiency and compliance while collaborating with finance and sales teams.

🕒 March 10

Gray

1001 - 5000

🤝 B2B

🛍️ eCommerce

📱 Media

Field Operations Manager overseeing steel field operations nationwide for NexGen. Supervising General Superintendents and ensuring project execution aligns with plans and timelines.

🕒 March 10

Courtyard.io

11 - 50

Industrial Engineer optimizing logistics and warehouse operations for Courtyard.io, a collectibles startup. Architecting workflows and designing systems to enhance operational efficiency.

🇺🇸 United States – Remote

🔥 Funding within the last year

💰 $30M Series A - Courtyard on 2025-07

⏰ Full Time

🟡 Mid-level

🟠 Senior

⚙️ Operations