Senior Engineer, Network Observability

11 - 50 employees

Founded 2017

🤖 Artificial Intelligence

☁️ SaaS

💰 $100M Debt Financing on 2022-12

Artificial Intelligence • Cloud Computing • SaaS

CoreWeave is a cloud service provider that specializes in purpose-built infrastructure designed for AI workloads. Known as the AI Hyperscaler™, CoreWeave offers a range of products including GPU and CPU compute services, storage solutions, and networking services optimized for deep learning, AI model training, and rendering applications. With a robust cloud platform, CoreWeave simplifies complex infrastructure management, ensuring reliability, scalability, and high-performance computing suitable for leading AI labs and enterprises.

Senior Engineer, Network Observability

🕒 June 4

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟠 Senior

🧑‍💻 Full-stack Engineer

Ansible

Cloud

Grafana

Kubernetes

Linux

Prometheus

Python

Switching

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

CoreWeave

11 - 50 employees

Founded 2017

🤖 Artificial Intelligence

☁️ SaaS

💰 $100M Debt Financing on 2022-12

Artificial Intelligence • Cloud Computing • SaaS

📋 Description

• We’re seeking a talented and experienced Senior Engineer for Network Observability to join our Network Observability team. In this role, you will be a key player in designing, developing, and maintaining the monitoring, telemetry, and observability systems that keep CoreWeave’s GPU cloud network operating reliably and at scale. • You’ll focus on building solutions that provide real-time insights into network performance, ensuring that issues are detected proactively and resolved quickly. • Develop, optimize, and maintain network observability platforms. Use your skills in Python and Golang to create and automate collectors, exporters, and dashboards that provide deep visibility into network health and performance. • Collaborate with Network Engineering and Platform teams to ingest and unify logs, metrics, and events from a variety of platforms (Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, SR Linux, etc.) into a single observability pipeline. • Design and implement scalable telemetry solutions using protocols like gNMI, SNMP, and streaming analytics. Ensure advanced alerting and anomaly detection with tools such as Prometheus, Grafana, and Alertmanager. • Work closely with network developers, site reliability engineers, and security teams to integrate observability solutions across the broader infrastructure. • Participate in design discussions, RFCs, and architectural decisions. • Join a rotating on-call schedule to troubleshoot and resolve observability-related issues. Provide timely support to operations teams, quickly isolating and fixing problems when they arise. • Guide junior team members, share best practices, and foster a culture of continuous learning and improvement within the observability domain.

🎯 Requirements

• Deep familiarity with Prometheus, Grafana, Alertmanager, gNMI, and SNMP. Experience writing or extending custom metric collectors/exporters is a plus. • Experience as a Network Engineer, SRE, Software Developer, or Systems Administrator in large-scale environments. A track record of building and operating robust telemetry and monitoring solutions is a plus. • Passion for automating tasks and processes. You find satisfaction in creating workflows that handle repetitive tasks and reduce human error to near zero. • Comfortable containerizing solutions in Kubernetes, designing, building, and deploying container-based workloads efficiently. • Proficient with Python, Go, and Bash, plus familiarity with configuration management and templating tools (e.g., Ansible, Jinja2). . • Strong knowledge of Linux systems and IP networking concepts, with hands-on experience in routing, switching, and network troubleshooting. • Practical knowledge with a variety of platforms, including Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, and SR Linux. • Collaborative, humble, and always ready to help others while staying open to learning from more senior colleagues.

🏖️ Benefits

• Family-level Medical Insurance • Family-level Dental Insurance • Generous Pension Contribution • Life Assurance at 4x Salary • Critical Illness Cover • Employee Assistance Programme • Tuition Reimbursement • Work culture focused on innovative disruption

Apply Now

Similar Jobs

Sustaining Engineering Lead

🕒 June 4

Actian

201 - 500

☁️ SaaS

🏢 Enterprise

Sustaining Engineering Lead managing complex technical issues in data management with a proactive approach. Leading team of engineers to resolve escalated technical challenges.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟠 Senior

🧑‍💻 Full-stack Engineer

Distributed Systems

Software Engineer, Fullstack

🕒 June 3

Firstup

201 - 500

🏢 Enterprise

👥 HR Tech

☁️ SaaS

Senior Software Engineer responsible for Firstup's innovative public APIs and third-party integrations. Collaborate within product engineering to enhance employee experiences across enterprise systems.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🧑‍💻 Full-stack Engineer

🇬🇧 UK Skilled Worker Visa Sponsor

AWS

Cloud

JavaScript

Microservices

Node.js

React

Ruby

TypeScript

Software Engineer

🕒 June 2

Wealth Wizards

51 - 200

💸 Finance

💳 Fintech

☁️ SaaS

Software Engineer developing and maintaining engaging digital experiences in a SaaS platform for financial advice. Collaborating with cross-functional teams to enhance innovative solutions in fintech.

🇬🇧 United Kingdom – Remote

💵 £45k - £67k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

🧑‍💻 Full-stack Engineer

JavaScript

Node.js

NoSQL

React

TypeScript

Software Engineer, Pub/Sub

🕒 June 2

Ably

51 - 200

🔌 API

🤝 B2B

🛍️ eCommerce

Software Engineer developing core Pub/Sub platform for Ably's messaging services. You'll solve distributed systems problems and push to deliver impactful features rapidly.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

🧑‍💻 Full-stack Engineer

🇬🇧 UK Skilled Worker Visa Sponsor

Distributed Systems

JavaScript

Node.js

Senior Product Engineer

🕒 June 1

Spotted Zebra

1 - 10

🏢 Enterprise

👥 HR Tech

Senior Product Engineer collaborating closely with Product Manager on AI-native hiring platform development. Responsible for end-to-end ownership of features leveraging AI tooling in a startup environment.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟠 Senior

🧑‍💻 Full-stack Engineer

🇬🇧 UK Skilled Worker Visa Sponsor

GraphQL

React

TypeScript