Inference Optimization Engineer

Job not on LinkedIn

August 8

Apply Now
Logo of BentoML

BentoML

Artificial Intelligence • B2B • SaaS

BentoML is a flexible platform designed to deploy and manage AI/ML models and custom inference pipelines in production. It offers a unified interface for seamless deployment, scaling, and optimization of various models, including large language models (LLMs). The platform empowers users to maintain full control over their AI models by allowing deployments in any environment, whether cloud or on-premise, while ensuring security and compliance without the data ever leaving the user's infrastructure.

📋 Description

• Optimize inference in single-GPU, multi-GPU, and multi-node serving setups. • Build repeatable tests that model production traffic; track and report vLLM, SGLang, TRT-LLM, and future runtimes. • Reduce memory use and compute cost with mixed precision, better KV-cache handling, quantization, and speculative decoding. • Improve batching, caching, load balancing, and model-parallel execution. • Write technical posts, contribute code, and present findings to the open-source community.

🎯 Requirements

• Deep understanding of transformer architecture and inference engine internals. • Hands-on experience speeding up model serving through batching, caching, load balancing. • Experienced with inference engines such as vLLM, SGLang, or TRT-LLM (upstream contributions are a plus). • Experienced with inference optimization techniques: quantization, distillation, speculative decoding, or similar. • Proficiency in CUDA and use of profiling tools like Nsight, nvprof, or CUPTI. Proficiency in Triton and ROCm is a bonus. • Track record of blog posts, conference talks, or open-source projects in ML systems is a bonus.

🏖️ Benefits

• competitive salary • equity • learning budget • paid conference travel

Apply Now

Similar Jobs

August 8

Join Commonware to build applications and cloud-based solutions in a dynamic team.

Apache

Assembly

Cloud

Rust

August 8

The Engineer will design systems and develop software for solar energy projects. Terabase Energy focuses on automation to enhance renewable energy efficiency.

August 8

Drive collaboration in high-speed interconnects at a global industrial tech leader.

Assembly

August 8

Seeking skilled OKTA Engineer to implement IAM solutions. Experience with OKTA's Identity Cloud platform required.

Azure

Cloud

Terraform

August 8

Seeking a Neutronics Engineer to perform reactor design calculations and analyses for Oklo's radioisotope production.

Python

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com