EP133: RelayLLM Slashes AI Costs With Collaborative Decoding

Description

"RelayLLM: Efficient Reasoning via Collaborative Decoding":

The Problem:Deploying Large Language Models (LLMs) for complex reasoning tasks is computationally expensive, whereas Small Language Models (SLMs) are resource-efficient but often lack the necessary reasoning capabilities. Existing collaborative methods attempt to solve this by routing entire queries to either the SLM or the LLM. However, this "all-or-nothing" approach is inefficient because the SLM is usually capable of handling the majority of the reasoning steps on its own, needing expert intervention only at specific, critical moments.

The Solution:The authors propose RelayLLM, a novel framework for token-level collaborative decoding. Instead of using an external router, RelayLLM turns the SLM into an active controller. During generation, the SLM can autonomously pause its output and issue a special <call> command to invoke the LLM for a specific number of tokens. Once the LLM generates the requested expert guidance, control is returned to the SLM to digest the information and resume the reasoning process, effectively "relaying" the task between the two models.

Training Method:RelayLLM is trained using a two-stage framework:

Supervised Warm-up: The SLM is first taught the basic syntax and structure of the <call> command using synthetic data to prevent distribution shifts.
Reinforcement Learning: The model is fine-tuned using Group Relative Policy Optimization (GRPO) with a custom "difficulty-aware reward". This reward system teaches the SLM when to strategically seek help by rewarding independent success, penalizing stubbornness (failing to ask for help on hard queries), and incentivizing exploration on highly complex problems.

Key Results:Testing across six mathematical benchmarks demonstrated that RelayLLM effectively bridges the capability gap between small and large models. For example, it improved the average accuracy of a Qwen3-1.7B model from 42.5% to 49.52%. Remarkably, this performance boost was achieved while invoking the LLM for only 1.07% of the total generated tokens, resulting in a 98.2% cost reduction compared to a traditional query-level router with matched performance.

Listen

Description

Want to check another podcast?