LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.
hub Canonical reference
LLMLingua: Com- pressing prompts for accelerated inference of large language models
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
Input compression increases net cost and reduces accuracy as models respond longer; output compression reduces cost on most models, with surface text diverging from unconstrained references on non-reasoning models.
Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
LLM-based compression of financial source material can alter downstream investment decisions via decontextualization and model dependency, addressed by an agentic auditing approach that checks multiple compressions against the original.
Empirical study demonstrates that cost-aware skill rewriting for LLM agents can achieve 7% total cost reduction and 6% agent-token cost reduction with preserved quality on SkillsBench.
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
Hierarchical prompt-domain control framework separates schema distillation from online semantic adaptation in agentic LLMs using an oracle loop, evaluated on a Multi-Fidelity Bayesian Optimization testbed.
citing papers explorer
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
LightThinker++: From Reasoning Compression to Memory Management
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
-
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.