Senior Engineer, Network Observability

Job not on LinkedIn

🔥 0 minutes ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of CoreWeave

CoreWeave

11 - 50 employees

Founded 2017

🤖 Artificial Intelligence

☁️ SaaS

💰 $100M Debt Financing on 2022-12

Artificial Intelligence • Cloud Computing • SaaS

CoreWeave is a cloud service provider that specializes in purpose-built infrastructure designed for AI workloads. Known as the AI Hyperscaler™, CoreWeave offers a range of products including GPU and CPU compute services, storage solutions, and networking services optimized for deep learning, AI model training, and rendering applications. With a robust cloud platform, CoreWeave simplifies complex infrastructure management, ensuring reliability, scalability, and high-performance computing suitable for leading AI labs and enterprises.

📋 Description

• We’re seeking a talented and experienced Senior Engineer for Network Observability to join our Network Observability team. In this role, you will be a key player in designing, developing, and maintaining the monitoring, telemetry, and observability systems that keep CoreWeave’s GPU cloud network operating reliably and at scale. • You’ll focus on building solutions that provide real-time insights into network performance, ensuring that issues are detected proactively and resolved quickly. • Develop, optimize, and maintain network observability platforms. Use your skills in Python and Golang to create and automate collectors, exporters, and dashboards that provide deep visibility into network health and performance. • Collaborate with Network Engineering and Platform teams to ingest and unify logs, metrics, and events from a variety of platforms (Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, SR Linux, etc.) into a single observability pipeline. • Design and implement scalable telemetry solutions using protocols like gNMI, SNMP, and streaming analytics. Ensure advanced alerting and anomaly detection with tools such as Prometheus, Grafana, and Alertmanager. • Work closely with network developers, site reliability engineers, and security teams to integrate observability solutions across the broader infrastructure. • Participate in design discussions, RFCs, and architectural decisions. • Join a rotating on-call schedule to troubleshoot and resolve observability-related issues. Provide timely support to operations teams, quickly isolating and fixing problems when they arise. • Guide junior team members, share best practices, and foster a culture of continuous learning and improvement within the observability domain.

🎯 Requirements

• Deep familiarity with Prometheus, Grafana, Alertmanager, gNMI, and SNMP. Experience writing or extending custom metric collectors/exporters is a plus. • Experience as a Network Engineer, SRE, Software Developer, or Systems Administrator in large-scale environments. A track record of building and operating robust telemetry and monitoring solutions is a plus. • Passion for automating tasks and processes. You find satisfaction in creating workflows that handle repetitive tasks and reduce human error to near zero. • Comfortable containerizing solutions in Kubernetes, designing, building, and deploying container-based workloads efficiently. • Proficient with Python, Go, and Bash, plus familiarity with configuration management and templating tools (e.g., Ansible, Jinja2). . • Strong knowledge of Linux systems and IP networking concepts, with hands-on experience in routing, switching, and network troubleshooting. • Practical knowledge with a variety of platforms, including Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, and SR Linux. • Collaborative, humble, and always ready to help others while staying open to learning from more senior colleagues.

🏖️ Benefits

• Family-level Medical Insurance • Family-level Dental Insurance • Generous Pension Contribution • Life Assurance at 4x Salary • Critical Illness Cover • Employee Assistance Programme • Tuition Reimbursement • Work culture focused on innovative disruption

Apply Now

Similar Jobs

🔥 6 hours ago

PlannerPal

1 - 10

⚡ Productivity

🤝 B2B

☁️ SaaS

Lead Full-Stack Engineer at PlannerPal designing and developing solutions for financial tech. Oversee a squad of engineers while enhancing platform architecture.

🔥 6 hours ago

Actian

201 - 500

☁️ SaaS

🏢 Enterprise

Sustaining Engineering Lead managing complex technical issues in data management with a proactive approach. Leading team of engineers to resolve escalated technical challenges.

🔥 23 hours ago

Wiley

5001 - 10000

📚 Education

🔬 Science

Senior Software Engineer enhancing the KnowItAll platform using C++. Collaborate with engineers and domain experts for legacy systems improvements.

🕒 Yesterday

Firstup

201 - 500

🏢 Enterprise

👥 HR Tech

☁️ SaaS

Senior Software Engineer responsible for Firstup's innovative public APIs and third-party integrations. Collaborate within product engineering to enhance employee experiences across enterprise systems.

🕒 Yesterday

NatWest Group

10,000+ employees

🏦 Banking

💸 Finance

💳 Fintech

Full Stack Engineer designing, producing, and implementing software solutions in a permanent feature team. Collaborating across business, applications, data, and infrastructure domains using Agile methods.