arxiv: 2603.25340 · v2 · submitted 2026-03-26 · 💻 cs.CL

Recognition: no theorem link

Large Language Model as Token Compressor and Decompressor

Wenbing Li , Yiran Wang , Zikai Song , Jielei Zhang , Tianhao Zhao , Junkai Lin , Wei Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords token compressionlong-context processinglarge language modelsautoencodingLoRA fine-tuningZ-tokensefficient inference

0 comments

The pith

An off-the-shelf LLM can be fine-tuned with LoRA to compress long texts into adaptive sequences of Z-tokens while preserving reconstruction and task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a pretrained large language model can be adapted into a compressor and decompressor for long inputs. It uses a self-expressive autoencoding setup with lightweight LoRA adapters to map text into shorter, variable-length sequences of latent codes called Z-tokens. The encoding adjusts token count based on content density, using a length regularizer to favor compact representations for redundant segments. Experiments on Wikipedia, CNN/DailyMail, HotpotQA, and QuALITY indicate that the compressed form supports accurate reconstruction and maintains downstream accuracy while lowering memory use and latency during inference.

Core claim

By fine-tuning a pretrained LLM with LoRA adapters on a self-expressive autoencoding objective, long texts map to compact sequences of learned latent codes termed Z-tokens; these codes decode back to natural language or task outputs, reduce effective context length in a content-adaptive way, and support both direct decoding from compressed states and autoregressive generation inside the Z-token space.

What carries the argument

The self-expressive autoencoding framework that trains the LLM via LoRA to produce and decode variable-length Z-tokens according to an information-density budget.

If this is right

Effective context length shrinks while reconstruction quality and task accuracy stay intact.
Generation-stage memory consumption and overall latency drop for long inputs.
Direct decoding becomes possible straight from the compressed Z-token sequence.
Autoregressive generation can run inside the Z-token space itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The variable-length scheme may allow stacking multiple compression stages for extremely long documents.
Z-tokens could serve as a drop-in interface for retrieval-augmented pipelines that need to handle lengthy sources.
The same autoencoding objective might extend to compressing structured data such as code repositories or dialogue histories.

Load-bearing premise

Fine-tuning with LoRA on the self-expressive autoencoding objective produces Z-tokens that preserve enough information for faithful reconstruction and downstream task performance without extensive post-hoc adjustments.

What would settle it

Measure whether Z-token compressed inputs on HotpotQA or QuALITY yield substantially lower accuracy than the original full-length texts; a consistent large gap would show the latent codes lose critical information.

Figures

Figures reproduced from arXiv: 2603.25340 by Jielei Zhang, Junkai Lin, Tianhao Zhao, Wei Yang, Wenbing Li, Yiran Wang, Zikai Song.

**Figure 1.** Figure 1: Our framework consists of three components: a compressor, an inference module, and a decompressor. It supports two usage [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Here we demonstrat the flexible use of Z-tokens. The [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: BLEU-4 at different input length and compression ratio. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

In this paper, we study whether an off-the-shelf LLM can be adapted into a discrete, variable-length token compressor and decompressor for long-context processing. To this end, we design a self-expressive autoencoding framework that fine-tunes a pretrained LLM with lightweight LoRA adapters to map long texts into compact sequences of learned latent codes, termed Z-tokens, and to decode them back into natural language or task outputs. The resulting representation is content-adaptive: less predictable or information-dense segments can receive more Z-tokens, while redundant regions can be represented more compactly through a budget-aware length regularizer. Our method is evaluated on long-context datasets such as Wikipedia, CNN/DailyMail, HotpotQA, and QuALITY, showing that it preserves reconstruction quality and downstream performance while reducing effective context length, generation-stage memory usage, and end-to-end latency. This simple design supports both direct decoding from compressed contexts and autoregressive generation in the Z-token space, providing a practical interface for efficient long-context inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns an off-the-shelf LLM into a variable-length token compressor and decompressor via LoRA-tuned self-expressive autoencoding, but the abstract supplies no numbers so the preservation claims stay untested.

read the letter

The main takeaway is that they fine-tune a pretrained LLM with lightweight LoRA adapters to map long text into compact Z-token sequences and back again, using a budget-aware regularizer so denser parts get more tokens. This gives a single model that handles both compression and downstream generation in the compressed space, which could trim memory and latency for long-context work on datasets like Wikipedia, CNN/DailyMail, HotpotQA, and QuALITY. The framing is straightforward and avoids new modules, relying instead on standard supervised fine-tuning of the existing LLM as both encoder and decoder. That part is clean and the variable-length control via the regularizer is a sensible engineering choice for content-adaptive compression. The approach also supports direct decoding from the Z-tokens or autoregressive generation inside that space, which keeps the interface simple. The soft spot is the complete lack of quantitative results, ablations, or even basic metrics in the abstract. Without compression ratios, reconstruction fidelity numbers, or downstream accuracy drops on the QA tasks, it is impossible to judge whether the Z-tokens retain the multi-hop facts needed or whether the regularizer over-compresses key segments. The concern that reconstruction loss might track surface statistics more than reasoning chains is reasonable until the full experiments appear. This is aimed at engineers who need practical ways to shrink long contexts without swapping in entirely new architectures. A reader already working on inference optimizations would find the setup easy to try. It deserves a serious referee because the core idea is grounded in existing tools and the problem is relevant, even though the current version will need heavy revision once the numbers and controls are added. I would send it to review.

Referee Report

3 major / 2 minor

Summary. The paper proposes adapting an off-the-shelf LLM into a discrete, variable-length token compressor and decompressor via a self-expressive autoencoding framework. It fine-tunes the model with LoRA adapters to map long input texts to compact sequences of learned latent codes (Z-tokens) and decode them back to natural language or task outputs. A budget-aware length regularizer makes the representation content-adaptive, allocating more Z-tokens to information-dense segments. The approach is evaluated on long-context datasets including Wikipedia, CNN/DailyMail, HotpotQA, and QuALITY, with claims that it preserves reconstruction quality and downstream task performance while reducing effective context length, memory usage, and latency. It also supports direct decoding from compressed contexts and autoregressive generation in Z-token space.

Significance. If the central claims hold, the work would offer a lightweight, practical interface for efficient long-context inference that leverages existing pretrained LLMs without major architectural redesign. The variable-length, content-adaptive compression could meaningfully reduce generation-stage memory and end-to-end latency on tasks requiring long contexts, while the dual support for reconstruction and direct Z-space generation provides flexibility. The simplicity of the LoRA-based self-expressive objective is a strength if it generalizes without extensive post-hoc tuning.

major comments (3)

[Framework and Evaluation] The self-expressive autoencoding objective (described in the framework section) relies on reconstruction loss that aligns with next-token or embedding-level statistics; this does not automatically guarantee retention of reasoning chains or multi-hop facts required for downstream performance on HotpotQA and QuALITY. Without an ablation isolating reconstruction metrics from task accuracy on the same splits, the preservation claim for information-dense segments remains unverified and load-bearing for the central thesis.
[Method] The budget-aware length regularizer is presented as enabling content-adaptive allocation, yet no sensitivity analysis or comparison to fixed-length baselines is reported. If the regularizer compresses high-entropy regions too aggressively, downstream QA performance can degrade even when aggregate reconstruction looks acceptable; this interaction is central to the variable-length advantage and requires explicit quantification.
[Experiments] The abstract and evaluation description list datasets and high-level outcomes but supply no quantitative tables, training hyperparameters (e.g., LoRA rank, learning rate), or full-context baseline numbers. This absence prevents assessment of effect sizes and makes it impossible to confirm that Z-token compression actually outperforms standard long-context handling on the reported metrics.

minor comments (2)

[Introduction] Notation for Z-tokens is introduced without a formal definition or dimensionality specification; a brief equation or diagram would clarify how they differ from standard token embeddings.
[Method] The claim that the method 'supports both direct decoding from compressed contexts and autoregressive generation in the Z-token space' would benefit from a short illustrative example or pseudocode to show the interface.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work adapting LLMs into content-adaptive token compressors via LoRA fine-tuning. We address each major comment below and will revise the manuscript to strengthen the presentation of results and ablations.

read point-by-point responses

Referee: [Framework and Evaluation] The self-expressive autoencoding objective (described in the framework section) relies on reconstruction loss that aligns with next-token or embedding-level statistics; this does not automatically guarantee retention of reasoning chains or multi-hop facts required for downstream performance on HotpotQA and QuALITY. Without an ablation isolating reconstruction metrics from task accuracy on the same splits, the preservation claim for information-dense segments remains unverified and load-bearing for the central thesis.

Authors: We agree that reconstruction loss does not by itself guarantee retention of reasoning chains. Our evaluations on HotpotQA and QuALITY directly measure end-to-end task accuracy after compression, which serves as a proxy for fact retention. To make this explicit, we will add an ablation that reports both reconstruction metrics (e.g., perplexity, BLEU) and downstream accuracy on identical data splits, isolating the contribution of the self-expressive objective. revision: yes
Referee: [Method] The budget-aware length regularizer is presented as enabling content-adaptive allocation, yet no sensitivity analysis or comparison to fixed-length baselines is reported. If the regularizer compresses high-entropy regions too aggressively, downstream QA performance can degrade even when aggregate reconstruction looks acceptable; this interaction is central to the variable-length advantage and requires explicit quantification.

Authors: We acknowledge that the interaction between the regularizer and high-entropy segments needs explicit quantification. In the revision we will add sensitivity analysis across different budget values, report performance curves for the regularizer, and include direct comparisons against fixed-length Z-token baselines on the same QA tasks to demonstrate the variable-length benefit. revision: yes
Referee: [Experiments] The abstract and evaluation description list datasets and high-level outcomes but supply no quantitative tables, training hyperparameters (e.g., LoRA rank, learning rate), or full-context baseline numbers. This absence prevents assessment of effect sizes and makes it impossible to confirm that Z-token compression actually outperforms standard long-context handling on the reported metrics.

Authors: The full manuscript contains quantitative tables in the experiments section, but we agree the abstract and high-level description lack specific numbers. We will expand the abstract with key effect sizes, add a dedicated hyperparameters table (LoRA rank, learning rate, etc.), and include explicit full-context baseline comparisons for memory, latency, and accuracy to allow direct assessment of improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fine-tuning method is self-contained

full rationale

The paper proposes a practical adaptation of off-the-shelf LLMs via LoRA fine-tuning on a self-expressive autoencoding objective to produce variable-length Z-tokens for compression and decompression. All central claims rest on empirical evaluations of reconstruction quality and downstream task performance (e.g., HotpotQA, QuALITY) rather than any derivation that reduces predictions to fitted inputs by construction. No self-definitional loops, no fitted parameters renamed as predictions, no load-bearing self-citations, and no uniqueness theorems imported from prior author work. The method is a standard supervised training pipeline whose outputs are measured against external benchmarks, making the derivation chain independent of its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The claim depends on the assumption that a pretrained LLM has enough capacity to learn a useful latent compression mapping through LoRA fine-tuning, plus the introduction of Z-tokens as a new representation whose quality is measured only internally.

free parameters (2)

LoRA rank and scaling
Hyperparameters controlling the adapter size and strength that must be chosen before training.
length budget parameter
Scalar that sets the target number of Z-tokens and is used inside the regularizer.

axioms (1)

domain assumption A pretrained LLM can be fine-tuned to map arbitrary text segments into a compact sequence of learned latent codes while remaining usable for generation and downstream tasks
Invoked when the paper states that the same model serves as both compressor and decompressor.

invented entities (1)

Z-tokens no independent evidence
purpose: Discrete latent codes that represent compressed variable-length segments of the original text
New token type introduced by the framework; no external validation or physical interpretation is provided.

pith-pipeline@v0.9.0 · 5490 in / 1458 out tokens · 67478 ms · 2026-05-15T00:51:31.603008+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction
cs.CV 2026-04 unverdicted novelty 6.0

OmniTrend predicts popularity by combining separate content attractiveness and contextual exposure predictors using cross-modal and exogenous signals.
HotComment: A Benchmark for Evaluating Popularity of Online Comments
cs.AI 2026-04 unverdicted novelty 6.0

HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
cs.MM 2026-04 unverdicted novelty 5.0

A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 4 Pith papers

[1]

Peters, and Arman Cohan

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer, 2020. 2

work page 2020
[2]

Token merging: Your vit but faster, 2023

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster, 2023. 1, 2

work page 2023
[3]

Hudson, Ehsan Adeli, et al

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, et al. On the opportunities and risks of foundation models, 2022. 1

work page 2022
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...

work page 2020
[5]

Adapting language models to compress con- texts, 2023

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress con- texts, 2023. 1, 2, 6, 7, 8

work page 2023
[6]

Generating long sequences with sparse transformers, 2019

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019. 2

work page 2019
[7]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseek- moe: Towards ultimate expert specialization in mixture-of- experts language models, 2024. 2

work page 2024
[8]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers, 2021. 5

work page 2021
[9]

Language mod- eling is compression, 2024

Gr ´egoire Del´etang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau- Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. Language mod- eling is compression, 2024. 1

work page 2024
[10]

Maskllm: Learnable semi-structured sparsity for large language models, 2024

Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, and Xin- chao Wang. Maskllm: Learnable semi-structured sparsity for large language models, 2024. 2

work page 2024
[11]

Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity, 2022. 2

work page 2022
[12]

In-context autoencoder for context compression in a large language model, 2024

Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model, 2024. 1, 2, 6, 7, 8

work page 2024
[13]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 6

work page 2021
[14]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression, 2024

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression, 2024. 2

work page 2024
[15]

Discrete latent variable representations for low- resource text classification, 2020

Shuning Jin, Sam Wiseman, Karl Stratos, and Karen Livescu. Discrete latent variable representations for low- resource text classification, 2020. 2

work page 2020
[16]

Learned token pruning for transformers, 2022

Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. Learned token pruning for transformers, 2022. 1, 2

work page 2022
[17]

Re- former: The efficient transformer, 2020

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Re- former: The efficient transformer, 2020. 2

work page 2020
[18]

The narrativeqa reading comprehension chal- lenge, 2017

Tom ´aˇs Ko ˇcisk´y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G ´abor Melis, and Edward Grefenstette. The narrativeqa reading comprehension chal- lenge, 2017. 5

work page 2017
[19]

Compact: Common-token optimized model pruning across channels and tokens, 2025

Eugene Kwek and Wenpeng Yin. Compact: Common-token optimized model pruning across channels and tokens, 2025. 2

work page 2025
[20]

Compressing context to enhance inference efficiency of large language models, 2023

Yucheng Li, Bo Dong, Chenghua Lin, and Frank Guerin. Compressing context to enhance inference efficiency of large language models, 2023. 1

work page 2023
[21]

500xcompressor: Generalized prompt compression for large language models,

Zongqian Li, Yixuan Su, and Nigel Collier. 500xcompressor: Generalized prompt compression for large language models,

work page
[22]

Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, and James Glass

Alexander H. Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, and James Glass. Cross-modal discrete representation learning, 2021. 2

work page 2021
[23]

Discrete semantic tokeniza- tion for deep ctr prediction, 2024

Qijiong Liu, Hengchang Hu, Jiahao Wu, Jieming Zhu, Min- Yen Kan, and Xiao-Ming Wu. Discrete semantic tokeniza- tion for deep ctr prediction, 2024. 2

work page 2024
[24]

Catanet: Effi- cient content-aware token aggregation for lightweight image super-resolution, 2025

Xin Liu, Jie Liu, Jie Tang, and Gangshan Wu. Catanet: Effi- cient content-aware token aggregation for lightweight image super-resolution, 2025. 2

work page 2025
[25]

Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021. 2

work page 2021
[26]

Learning to compress prompts with gist tokens, 2024

Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens, 2024. 1, 2, 6, 7

work page 2024
[27]

Bow- man

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel R. Bow- man. Quality: Question answering with long input texts, yes!, 2022. 5

work page 2022
[28]

Rae, Anna Potapenko, Siddhant M

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. Compressive transformers for long- range sequence modelling, 2019. 2

work page 2019
[29]

Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021. 2

work page 2021
[30]

Efficient content-based sparse attention with rout- ing transformers, 2020

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with rout- ing transformers, 2020. 2 9

work page 2020
[31]

Learning by distilling context, 2022

Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context, 2022. 1, 2

work page 2022
[32]

Hier- archical context merging: Better long context understanding for pre-trained llms, 2024

Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, and Jinwoo Shin. Hier- archical context merging: Better long context understanding for pre-trained llms, 2024. 2

work page 2024
[33]

Gonzalez, and Raluca Ada Popa

Sijun Tan, Xiuyu Li, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E. Gonzalez, and Raluca Ada Popa. Lloco: Learning long contexts offline, 2024. 6, 8

work page 2024
[34]

Llama: Open and efficient foundation lan- guage models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models, 2023. 1

work page 2023
[35]

Neural discrete representation learning,

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning,

work page
[36]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 1

work page 2023
[37]

Chi, Tat- sunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Bar- ret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tat- sunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022. 1

work page 2022
[38]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jia- long Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang...

work page 2025
[39]

S 2ft: Efficient, scalable and generalizable llm fine- tuning by structured sparsity, 2024

Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada, Linjun Zhang, Huaxiu Yao, and Beidi Chen. S 2ft: Efficient, scalable and generalizable llm fine- tuning by structured sparsity, 2024. 2

work page 2024
[40]

Cohen, Ruslan Salakhutdinov, and Christo- pher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christo- pher D. Manning. Hotpotqa: A dataset for diverse, explain- able multi-hop question answering, 2018. 5

work page 2018
[41]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y . X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Native sparse atten- tion: Hardware-aligned and natively trainable sparse atten- tion, 2025. 2

work page 2025
[42]

Big bird: Transformers for longer sequences, 2021

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021. 2

work page 2021
[43]

Long context compression with acti- vation beacon, 2024

Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou. Long context compression with acti- vation beacon, 2024. 2

work page 2024
[44]

Un- supervised discrete sentence representation learning for in- terpretable neural dialog generation, 2018

Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. Un- supervised discrete sentence representation learning for in- terpretable neural dialog generation, 2018. 2

work page 2018
[45]

Aim: Adaptive inference of multi-modal llms via token merging and pruning, 2025

Yiwu Zhong, Zhuoming Liu, Yin Li, and Liwei Wang. Aim: Adaptive inference of multi-modal llms via token merging and pruning, 2025. 1, 2

work page 2025
[46]

Discrete autoencoders for sequence models, 2018

Łukasz Kaiser and Samy Bengio. Discrete autoencoders for sequence models, 2018. 2

work page 2018
[47]

Fast de- coding in sequence models using discrete latent variables,

Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. Fast de- coding in sequence models using discrete latent variables,

work page