Staff Machine Learning Engineer – Infrastructure

Job not on LinkedIn

1 hour ago

Apply Now
Logo of Rad AI

Rad AI

Manufacturing • Hardware • Engineering

Rad AI is a company that specializes in custom manufacturing and processing solutions, particularly in the fabrication of components such as casters and material handling devices. They offer a range of products including adjustable casters and OEM parts designed for various industrial applications. Based in Somerset, Michigan, Rad AI is committed to high-quality engineering and manufacturing services.

51 - 200 employees

Founded 2018

🔧 Hardware

💰 $25M Series A on 2021-11

📋 Description

• Architect the infrastructure that supports our machine learning applications, services, and workflows • Architect and maintain our ML platform that supports continuous integration, continuous delivery, and continuous training for our machine learning models • Develop cloud-native services and serverless architectures to build scalable and resilient systems • Partner with data scientists to design the data pipeline that enable various machine learning models in production • Write code that meets our internal standards for security, style, maintainability, and best practices for a high-scale HIPAA web environment • Design, deploy, and maintain the full ML platform stack including monitoring and observability, data analytics, backend integration with customer-facing products, and the full model R&D lifecycle • Work with Product Management, Research, and Engineering to iterate on new features and address inefficiencies across our AI/ML infrastructure

🎯 Requirements

• 8+ years of industry experience in ML Engineering in cloud-native environments • In-depth knowledge of Python (required), Javascript/Typescript (nice to have), or other modern languages in the ML domain • Strong experience with infrastructure and DevOps tools such as Kubernetes, Docker, and Ansible • Strong knowledge of cloud computing platforms such as AWS (preferable), GCP, and Azure • Experience architecting distributed systems, storage systems, and databases • Experience working with machine learning frameworks such as PyTorch and LangGraph • Experience with Airflow (preferable) or other orchestration tools • Experience with infrastructure-as-code tools such as Terraform (preferable), Pulumi, Cloud Formation, etc. • Experience with monitoring, tracing, and logging tools such Cloudwatch, NewRelic, Grafana, etc. • Excellent communication skills, with a strong sense of ownership and a systematic approach to problem-solving • Proven ability to manage and lead active incidents, address what caused them, and establish systems to avoid them in the future via blameless postmortems

🏖️ Benefits

• Comprehensive Medical, Dental, Vision & Life insurance • HSA (with employer match), FSA, & DCFSA • 401(k) • 11 Paid Company Holidays • Location Flexibility (Remote-first company!) • Flexible PTO policy • Annual company-wide offsite • Periodic team offsites • Annual equipment stipend • For roles based outside the US, your recruiter can share more details

Apply Now

Similar Jobs

5 days ago

Director of Machine Learning leading Reddit's safety and moderation efforts to prevent harmful content. Collaborating with teams to innovate on real-time detection and user protection systems.

December 2

Principal Machine Learning Architect shaping AI strategy for Netflix's content promotion and distribution. Leading the development of advanced architectures and capabilities across media modalities.

December 1

Zigsaw

11 - 50

Staff Machine Learning Engineer at Pinterest driving development of ML systems for user engagement and product enhancement. Collaborating with teams on large-scale projects in core engineering.

December 1

Zigsaw

11 - 50

Staff Machine Learning Engineer developing advanced user understanding models for Pinterest's core products. Leading a team and collaborating cross-functionally to enhance user experiences.

Distributed Systems

November 29

Executive Director overseeing governance and implementation of AI and ML solutions focused on IT operations at CVS Health. Leading a team to drive innovative, responsible AI practices within the organization.

AWS

Azure

Cloud

Cyber Security

Google Cloud Platform

ITSM