Staff Machine Learning Systems Engineer

🕒 March 18

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Reddit, Inc.

Reddit, Inc.

501 - 1000 employees

Founded 2005

👥 B2C

📱 Media

🌍 Social Impact

B2C • Media • Social Impact

Reddit, Inc. is a social media platform that acts as a hub for thousands of communities, where users can engage in diverse conversations ranging from breaking news to niche interests. It enables users to post, comment, and vote on content, fostering a vibrant online community. Millions of people globally connect and share their passions on Reddit, creating a dynamic environment for authentic human interaction.

📋 Description

• Design end-to-end model lifecycle patterns (MLOps) to boost velocity of development for ML engineers, including data preparation, model management, experiment tracking, and more • Zero-to-one development and support of a graph ML codebase and platform that abstracts away common patterns and enables greater model scalability and iteration • Collaborate with ML engineers on performance tuning, including improving model training time, efficiency, and GPU training costs in a large, distributed ML training environment • Optimize batch data processing within a data warehouse and with tools such as Apache Beam, Apache Spark, Ray Data, and more • Architect pipelines to build and maintain massive graph data structures on the order of billions of nodes and tens of billions of edges

🎯 Requirements

• 8+ years of experience in ML infrastructure, including model training and model deployments • Hands-on experience with ML optimization, including memory and GPU profiling • Deep experience with cloud-based technologies for supporting an ML platform, including tools like GCP BigQuery, Google Cloud Storage, infrastructure-as-code (Terraform), and more • Hands-on experience administering and integrating MLOps tools for experiment tracking, model serving, and model registries (e.g. MLflow or Wandb) • Proficiency with the common programming languages and frameworks of ML, such as Python, PyTorch, Tensorflow, etc. • Deep experience working with distributed training frameworks, including Ray and Kubernetes • Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle. • Strong organizational & communication skills • Experience working with graph databases (Neo4j, JanusGraph, TigerGraph) is a big plus • Experience working with graph neural networks (GNNs) and associated graph ML frameworks (PyTorch Geometric, Deep Graph Library) is a big plus

🏖️ Benefits

• medical, dental, and vision insurance • 401(k) program with employer match • generous time off for vacation • parental leave

Apply Now

Similar Jobs

🕒 March 13

Bernhard

1001 - 5000

⚡ Energy

IS Technical Services Building Systems Analyst III optimizing energy systems and managing commissioning projects. Collaborating with project teams and ensuring compliance with industry best practices.

Cloud

🕒 March 10

Bernhard

1001 - 5000

⚡ Energy

Building Systems Analyst III optimizing energy management for building systems operations at ENFRA. Requires engineering background and collaboration across project teams with site optimization.

Cloud

🕒 March 10

Bernhard

1001 - 5000

⚡ Energy

Building Systems Analyst III optimizing energy and reviewing building systems for ENFRA. Collaborating with project teams and ensuring best practices in energy conservation measures and commissioning.

Cloud

🕒 March 10

Bernhard

1001 - 5000

⚡ Energy

Building Systems Analyst III optimizing energy operations for ENFRA's energy infrastructure projects. Collaborating with technical teams to ensure efficiency and sustainability in building systems.

Cloud

🕒 March 3

Bullhorn

1001 - 5000

👥 HR Tech

☁️ SaaS

🎯 Recruiter

Manager of Systems Engineering managing systems engineers in Bullhorn's Technical Operations team. Ensuring high performance and 24/7 availability in a fast-paced SaaS environment.

AWS

Azure

Cloud

DNS

Docker

Firewalls

Google Cloud Platform

Grafana

Java

Kubernetes

Linux

Prometheus

Python

TCP/IP

Terraform