SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
Large language model inference acceler- ation: A comprehensive hardware perspective
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
A new end-to-end modeling approach for latency-sensitive many-core architectures with globally shared L1 SPM tracks RTL golden models within 7% error while running up to 115x faster and supports profiling for design optimization.
A hybrid ASIC+eFPGA architecture is proposed to add adaptive security mechanisms to edge LLM inference while retaining ASIC efficiency.
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.
citing papers explorer
-
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
-
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
-
Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling
A new end-to-end modeling approach for latency-sensitive many-core architectures with globally shared L1 SPM tracks RTL golden models within 7% error while running up to 115x faster and supports profiling for design optimization.
-
Secure eFPGA-Enabled Edge LLM Inference: Architectural and Hardware Countermeasures
A hybrid ASIC+eFPGA architecture is proposed to add adaptive security mechanisms to edge LLM inference while retaining ASIC efficiency.
-
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.