Senior Platform Telemetry Engineer

Job not on LinkedIn

September 21

Apply Now
Logo of NVIDIA

NVIDIA

Artificial Intelligence • Gaming • Automotive

NVIDIA is a leading technology company specializing in accelerated computing and artificial intelligence. NVIDIA pioneers advancements in graphical processing units (GPUs), cloud computing, data centers, and virtual reality, with a focus on gaming, automotive, healthcare, and robotics industries. The company's innovations, such as NVIDIA Omniverse, transform traditional digital processes by enabling high-fidelity simulations and rendering tasks. Their applications span various industries, from autonomous vehicles using NVIDIA DRIVE to healthcare solutions with NVIDIA Clara, and AI-driven analytics and workflows.

📋 Description

• Drive next generation fleet management solutions for scaling AI infrastructure using GPUs and Grace solution from NVIDIA • Work with customers, product management and other architects to narrow down on requirements for implementation • Design architecture for fleet health monitoring and fault-remediation solution at scale • Work with customers and other architects to understand health monitoring requirements and leverage in-band and out-of-band capabilities • Create detailed architecture and perform POCs to validate architecture • Educate customers about product architecture and incorporate feedback • Write architecture specs and design documents; own end-to-end delivery across teams • Perform code reviews for code produced from architecture specs • Ensure product is properly tested; enhance unit testing and establish proper test plans • Drive product life cycles with QA teams to productize code and act as product owner • Articulate requirements in Jira and bug management tools and coordinate execution plans with managers • Contribute to all phases of product development: definition, architecture, design, implementation, debugging, testing, and early customer support

🎯 Requirements

• BS, MS, or PhD in EE/CS or related field of education (or equivalent experience) • 5+ years hands-on coding experience • Strong knowledge of time series databases like Influxdb & Prometheus • Strong knowledge of building and consuming REST APIs (Redfish is big plus) • Strong knowledge of telemetry visualization solutions like Grafana & Influx • Strong knowledge of firmware architecture, optimize firmware for low latency APIs • Strong knowledge of analyzing algorithms for time & space complexity and project system resource requirements • Proven record of solutions for scalability • Strong and demonstrable skill in C/C++ and Python • Experience programming and debugging skills for server platforms • Experience in SCM (e.g., Git, Perforce) and project management tools like Jira • Excellent written and oral communication skills • Excellent work ethics, teamwork, and commitment to finishing tasks • Self-starter with hands-on coding ability • Ways to stand out: Experience building telemetry collection & analysis engines; Experience with Redfish; Experience with notification systems like PagerDuty; Active OCP and DMTF contribution; Hands on with x86 or ARM system architecture; Familiarity with Confidential Compute; Experience with ML and multi-variable optimization techniques

🏖️ Benefits

• Eligible for equity • Benefits (unspecified)

Apply Now

Similar Jobs

September 20

Arcadis

10,000+ employees

Project Controls Engineer at Arcadis developing schedules and cost estimates for bridge and transit infrastructure projects, monitoring progress, forecasts, and EVM.

September 20

Senior Managed Services engineer maintaining Lucidworks Fusion search platforms, troubleshooting ingestion/indexing, leading incident response, and collaborating with cloud and support teams for enterprise customers.

AWS

Azure

Cloud

Google Cloud Platform

Java

JavaScript

Kubernetes

Python

Spark

September 20

Engineer responsible for creating substation drawing packages, SCADA settings, and mentoring team members. Engaging in customer specifications and project management activities with substantial technical duties.

RPA

September 19

Performance Engineer at Veeva optimizing Vault Quality Suite scalability, benchmarking, and production performance. Collaborates with developers and product management to diagnose and resolve performance bottlenecks.

AWS

Cloud

Docker

Java

JMeter

Linux

MySQL

Python

Shell Scripting

Vault

September 19

Engineer III implementing and maintaining healthcare information systems supporting skilled nursing facilities. Collaborating with healthcare professionals to improve usability and effectiveness of IT systems.

AWS

Azure

Citrix

Cloud

VMware

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com