Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
DiSP stratifies queries by difficulty using random trial estimates, trains a router and level-specific judges, then applies budgeted stop-on-acceptance selection to improve ICL accuracy and speed on classification tasks.
citing papers explorer
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
-
Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection
DiSP stratifies queries by difficulty using random trial estimates, trains a router and level-specific judges, then applies budgeted stop-on-acceptance selection to improve ICL accuracy and speed on classification tasks.
- Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management