"RelayLLM: Efficient Reasoning via Collaborative Decoding":
The Problem:Deploying Large Language Models (LLMs) for complex reasoning tasks is computationally expensive, whereas Small Language Models (SLMs) are resource-efficient but often lack the necessary reasoning capabilities. Existing collaborative methods attempt to solve this by routing entire queries to either the SLM or the LLM. However, this "all-or-nothing" approach is inefficient because the SLM is usually capable of handling the majority of the reasoning steps on its own, needing expert intervention only at specific, critical moments.
The Solution:The authors propose RelayLLM, a novel framework for token-level collaborative decoding. Instead of using an external router, RelayLLM turns the SLM into an active controller. During generation, the SLM can autonomously pause its output and issue a special <call> command to invoke the LLM for a specific number of tokens. Once the LLM generates the requested expert guidance, control is returned to the SLM to digest the information and resume the reasoning process, effectively "relaying" the task between the two models.
Training Method:RelayLLM is trained using a two-stage framework:
Key Results:Testing across six mathematical benchmarks demonstrated that RelayLLM effectively bridges the capability gap between small and large models. For example, it improved the average accuracy of a Qwen3-1.7B model from 42.5% to 49.52%. Remarkably, this performance boost was achieved while invoking the LLM for only 1.07% of the total generated tokens, resulting in a 98.2% cost reduction compared to a traditional query-level router with matched performance.