hub Canonical reference

LLMLingua: Com- pressing prompts for accelerated inference of large language models

Jiang, Huiqiang, Wu, Qianhui, Lin, Chin-Yew, Yang, Yuqing, Qiu, Lili · 2023 · DOI 10.18653/v1/2023.emnlp-main.825

Canonical reference. 80% of citing Pith papers cite this work as background.

17 Pith papers citing it

Background 80% of classified citations

open at publisher browse 17 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 4 baseline 1

citation-polarity summary

background 4 baseline 1

representative citing papers

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.

Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

cs.IR · 2026-04-15 · unverdicted · novelty 7.0

A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

cs.CL · 2025-02-04 · unverdicted · novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

cs.CL · 2024-10-14 · unverdicted · novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.

RouterBench: A Benchmark for Multi-LLM Routing System

cs.LG · 2024-03-18 · unverdicted · novelty 7.0

RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

Input compression increases net cost and reduces accuracy as models respond longer; output compression reduces cost on most models, with surface text diverging from unconstrained references on non-reasoning models.

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.

Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

cs.DC · 2026-04-14 · unverdicted · novelty 6.0

Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.

FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

cs.AI · 2026-04-12 · unverdicted · novelty 6.0

FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.

LightThinker++: From Reasoning Compression to Memory Management

cs.CL · 2026-04-04 · unverdicted · novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

cs.IR · 2026-04-03 · conditional · novelty 6.0

LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.

What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents

cs.CL · 2026-06-08 · unverdicted · novelty 5.0

Empirical study demonstrates that cost-aware skill rewriting for LLM agents can achieve 7% total cost reduction and 6% agent-token cost reduction with preserved quality on SkillsBench.

GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

cs.CL · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.

citing papers explorer

Showing 17 of 17 citing papers.

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents cs.CL · 2026-06-04 · unverdicted · none · ref 34
LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection cs.CR · 2026-05-12 · unverdicted · none · ref 39
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments cs.SE · 2026-05-04 · unverdicted · none · ref 10
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation cs.IR · 2026-04-15 · unverdicted · none · ref 17
A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression cs.CL · 2025-02-04 · unverdicted · none · ref 65
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory cs.CL · 2024-10-14 · unverdicted · none · ref 73
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
RouterBench: A Benchmark for Multi-LLM Routing System cs.LG · 2024-03-18 · unverdicted · none · ref 23
RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression cs.CL · 2026-06-23 · unverdicted · none · ref 1
Input compression increases net cost and reduces accuracy as models respond longer; output compression reduces cost on most models, with surface text diverging from unconstrained references on non-reasoning models.
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning cs.LG · 2026-06-08 · unverdicted · none · ref 16
Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts cs.CL · 2026-06-03 · unverdicted · none · ref 235
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads cs.DC · 2026-04-14 · unverdicted · none · ref 11
Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning cs.AI · 2026-04-12 · unverdicted · none · ref 16
FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
LightThinker++: From Reasoning Compression to Memory Management cs.CL · 2026-04-04 · unverdicted · none · ref 62
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference cs.IR · 2026-04-03 · conditional · none · ref 10
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents cs.CL · 2026-06-08 · unverdicted · none · ref 9
Empirical study demonstrates that cost-aware skill rewriting for LLM agents can achieve 7% total cost reduction and 6% agent-token cost reduction with preserved quality on SkillsBench.
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression cs.CL · 2026-05-09 · unverdicted · none · ref 50 · 2 links
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text cs.CV · 2026-05-07 · unverdicted · none · ref 15
LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.

LLMLingua: Com- pressing prompts for accelerated inference of large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer