pith. sign in

arxiv: 2606.18056 · v1 · pith:TZHK2LPLnew · submitted 2026-06-16 · 💻 cs.CL

ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

Pith reviewed 2026-06-27 00:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords hybrid attentioncontrollable sparsitylearnable allocationsliding window attentionfull attentionL0 regularizationLLM efficiencyattention patterns
0
0 comments X

The pith

A learnable method assigns full versus sliding-window attention to outperform hand-crafted rules at fixed sparsity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ConSA to learn which attention units in hybrid LLMs use full attention or sliding-window attention while meeting a user-specified sparsity target. It trains binary masks with L0 regularization and enforces the exact sparsity budget through an augmented Lagrangian at either layer or KV-head granularity. On 0.6B and 1.7B models the resulting allocations beat rule-based baselines, consistently placing sliding-window attention in bottom layers and concentrating full attention into contiguous middle-layer blocks. A sympathetic reader would care because the work shows that attention-type choices can be discovered rather than guessed, offering a path to more efficient inference without manual pattern engineering.

Core claim

ConSA employs L0 regularization to learn binary masks selecting between full attention and sliding-window attention for each attention unit, while an augmented Lagrangian constraint enforces a user-specified sparsity target at either layer or KV-head granularity. The learned allocations place sliding-window attention in the bottom layers and concentrate full attention into contiguous middle-layer blocks. These patterns diverge from evenly interleaved designs in rule-based methods, persist across model scales and sparsity levels, and deliver higher downstream performance than the baselines.

What carries the argument

L0-regularized binary mask learning with augmented Lagrangian sparsity constraint applied at layer-wise or KV-head-wise granularity to select full attention versus sliding-window attention.

If this is right

  • KV-head-wise allocation produces clearer gains than layer-wise allocation at the same sparsity.
  • The learned pattern of bottom-layer sliding-window attention and contiguous middle-layer full attention blocks holds across the tested scales and sparsity targets.
  • Learned allocations avoid the evenly interleaved patterns typical of rule-based methods.
  • The approach achieves the sparsity target while maintaining performance without requiring extra recovery training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the contiguous middle-layer full-attention blocks turn out to be broadly optimal, the same block structure could be hard-coded into larger models without any learning step.
  • The method could be extended to control other attention variants such as local-global or sparse attention patterns beyond sliding-window attention.
  • Layer depth appears to correlate with preferred attention type in a non-uniform way that uniform interleaving overlooks.
  • The same controllable-sparsity training could be tested on models larger than 1.7B to check whether the discovered pattern continues to hold.

Load-bearing premise

L0 regularization together with the augmented Lagrangian constraint can produce stable binary masks that preserve downstream performance at the target sparsity without post-training recovery steps.

What would settle it

Re-training or evaluating the models with the learned masks at the target sparsity and observing no accuracy gain over rule-based allocations at the same sparsity level, or finding that the bottom-SWA plus middle-FA block pattern fails to appear on a different model scale or architecture.

Figures

Figures reproduced from arXiv: 2606.18056 by Dianhai Yu, Junyuan Shang, Shuohuan Wang, Simeng Zhang, Tingwen Liu, Xiangzhao Hao, Yao Chen, Yilong Chen, Yinqi Yang.

Figure 1
Figure 1. Figure 1: Overview of ConSA. Left: the two-stage training pipeline. Stage 1 jointly optimizes the model parameters θ, mask parameters α, and Lagrange multipliers {λ, ϕ} on 1B tokens, with the constraint ρˆ(z) = ρ enforcing the user-specified target sparsity. Stage 2 binarizes the masks and continues pre-training for 100B tokens with a fixed FA/SWA assignment. Right: the per-head allocation mechanism. For each KV hea… view at source ↗
Figure 2
Figure 2. Figure 2: Convergence of the Lagrangian constraint loss [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Learned layer-wise FA/SWA allocation at ρ = 0.50. Each cell indicates whether a layer uses FA (red) or SWA (blue). 0 2 4 6 8 10 12 14 16 18 20 22 24 26 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 SWA Head Ratio 0.6B, ρ = 0.50 0.6B, ρ = 0.25 0.6B, ρ = 0.75 1.7B, ρ = 0.50 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training loss trajectories on 0.6B under [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Last-token attention distribution across representative layers of the 0.6B model, spanning the spectrum [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Learned scalar gates αi of the ablation variant across 28 layers. Nearly all layers converge to the same value, showing minimal differentiation. of 3 × 10−4 , a minimum learning rate of 3 × 10−5 , and 300 warmup steps. The global batch size is 128 with a maximum sequence length of 8,192. Training runs for 1,000 steps, corresponding to ap￾proximately 1B tokens. For Stage 2 continued pre￾training, we use the… view at source ↗
Figure 8
Figure 8. Figure 8: Lagrangian constraint loss on 0.6B at ρ ∈ {0.25, 0.50, 0.75}. The ρ = 0.25 and ρ = 0.75 settings use layer-wise and head-wise (all-layers) allocation; ρ = 0.50 additionally includes head-wise (single-layer). Dashed vertical lines mark the approximate convergence step for the slowest configuration in each panel. Rule-based (Jiang et al., 2023; Team, 2024) LoZA (Zhang et al., 2025) ConSA Allocation method Ha… view at source ↗
Figure 9
Figure 9. Figure 9: Learned layer-wise FA/SWA allocation on 0.6B at ρ ∈ {0.25, 0.50, 0.75}. Each cell indicates whether a layer uses FA (red) or SWA (blue). DuoAttention (Xiao et al., 2025). Since LoZA does not specify its calibration objective in detail and the full DuoAttention setup involves a distillation loss against a dense teacher together with synthetic retrieval data, we adopt a simplified variant for a controlled co… view at source ↗
Figure 10
Figure 10. Figure 10: Trajectory of expected sparsity E[ˆρ(z)] during Stage 1 mask learning on 0.6B at ρ ∈ {0.25, 0.50, 0.75}. Dashed horizontal lines indicate the target ρ. All configurations initially overshoot to a similar level before settling to their respective targets, with higher ρ requiring less correction and thus converging earlier. Model Dense FA ConSA (ρ = 0.50) 0.6B 17.42 × 1015 14.65 × 1015 (↓ 15.9%) 1.7B 52.57 … view at source ↗
Figure 11
Figure 11. Figure 11: Learned head-wise FA/SWA allocation across model scales and sparsity levels. Each cell indicates [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Cross-task last-token attention distribution for L1 of the 0.6B model. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cross-task last-token attention distribution for L4 of the 0.6B model. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Cross-task last-token attention distribution for L16 of the 0.6B model. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cross-task last-token attention distribution for L22 of the 0.6B model. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Head-wise last-token attention distribution for L9 of the 0.6B model (assigned to FA across three [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Head-wise last-token attention distribution for L27 of the 0.6B model (assigned to SWA across three [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
read the original abstract

Hybrid architectures combining full attention (FA) and sliding-window attention (SWA) are a promising paradigm for efficient LLM inference. However, existing methods typically rely on hand-crafted rules or simple post-hoc heuristics for FA/SWA allocation and offer limited analysis of the attention behaviors underlying these designs. We propose Controllable Sparsity in Hybrid Attention (ConSA), a framework that learns optimal FA/SWA assignment under a user-specified sparsity target. ConSA employs L0 regularization to learn binary masks selecting between FA and SWA for each attention unit, while an augmented Lagrangian constraint enforces the target sparsity at either layer or KV-head granularity. We evaluate ConSA on two LLMs at the 0.6B and 1.7B scales. Learned allocations consistently outperform rule-based baselines, with KV-head-wise allocation yielding clear gains over layer-wise allocation. The learned patterns place SWA in the bottom layers and concentrate FA into contiguous middle-layer blocks, diverging from evenly interleaved patterns in rule-based methods. This structure persists across model scales, sparsity levels, and allocation granularities, revealing a fine-grained spectrum of intrinsic attention behaviors that underlies the learned allocation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ConSA, a framework for learning optimal FA/SWA allocations in hybrid attention LLMs under a user-specified sparsity target. It employs L0 regularization to produce binary masks over attention units and an augmented Lagrangian multiplier to enforce the sparsity constraint at either layer-wise or KV-head-wise granularity. Evaluations are reported on 0.6B and 1.7B scale models, claiming that the learned allocations outperform rule-based baselines, that KV-head-wise allocation yields larger gains than layer-wise, and that the resulting patterns (SWA concentrated in bottom layers, FA in contiguous middle-layer blocks) are consistent across scales, sparsity levels, and granularities.

Significance. If the empirical results hold and the regularization reliably produces exact binary masks at the stated sparsity without post-training recovery, the work would supply a data-driven alternative to hand-crafted hybrid attention designs and could surface reproducible layer-wise attention preferences that generalize across model scales.

major comments (2)
  1. [Abstract] Abstract and evaluation sections: the central claim that learned allocations outperform rule-based baselines at user-specified sparsity requires evidence that the L0 term plus Lagrangian multiplier drives the allocation variables to exact binary values while preserving downstream performance; the provided text supplies no metrics on achieved vs. target sparsity, mask entropy at convergence, or post-mask recovery steps, leaving the attribution of gains to the allocation itself unverified.
  2. [Method] Method description of the augmented Lagrangian constraint: without reported analysis of constraint violation or binarity (e.g., fraction of mask entries strictly 0/1 at convergence), it is unclear whether the procedure meets the load-bearing assumption that stable binary masks are obtained at the target sparsity.
minor comments (1)
  1. [Abstract] The abstract refers to 'two LLMs at the 0.6B and 1.7B scales' but does not name the base models or datasets; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for explicit verification of the binarity and sparsity enforcement mechanisms in ConSA. We agree these details strengthen the central claims and will incorporate the requested analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation sections: the central claim that learned allocations outperform rule-based baselines at user-specified sparsity requires evidence that the L0 term plus Lagrangian multiplier drives the allocation variables to exact binary values while preserving downstream performance; the provided text supplies no metrics on achieved vs. target sparsity, mask entropy at convergence, or post-mask recovery steps, leaving the attribution of gains to the allocation itself unverified.

    Authors: We acknowledge that the manuscript does not report quantitative metrics confirming that the L0 regularization and augmented Lagrangian produce exact binary masks at the target sparsity. The current presentation relies on the design of the objective to achieve this outcome. In revision we will add tables reporting achieved vs. target sparsity, mask entropy at convergence, and the fraction of entries that are exactly 0/1 without any post-training recovery, allowing direct attribution of performance gains to the learned allocations. revision: yes

  2. Referee: [Method] Method description of the augmented Lagrangian constraint: without reported analysis of constraint violation or binarity (e.g., fraction of mask entries strictly 0/1 at convergence), it is unclear whether the procedure meets the load-bearing assumption that stable binary masks are obtained at the target sparsity.

    Authors: The augmented Lagrangian term is introduced precisely to enforce the user-specified sparsity while the L0 penalty encourages binarity. We agree that empirical confirmation of constraint satisfaction and mask binarity is necessary to validate the assumption. In the revision we will include analysis of constraint violation (e.g., mean absolute deviation from target sparsity) and the percentage of mask entries that converge to exactly 0 or 1, reported across layers, granularities, and model scales. revision: yes

Circularity Check

0 steps flagged

No circularity: independent training procedure with external benchmarks

full rationale

The paper presents ConSA as a standard optimization framework that applies L0 regularization plus augmented Lagrangian to enforce user-specified sparsity on FA/SWA masks during training. The central claim (learned allocations outperform rule-based baselines) is evaluated by direct comparison on downstream LLM performance at fixed model scales, which constitutes an external benchmark rather than a quantity defined inside the method. No equations, derivations, or self-citations are shown that reduce the reported gains to a fitted parameter or to a self-referential definition. The method is self-contained against external validation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available; the method implicitly assumes standard properties of L0 relaxation and augmented Lagrangian optimization.

axioms (2)
  • domain assumption L0 regularization can be approximated to produce binary FA/SWA masks
    Core mechanism stated in abstract without further justification.
  • domain assumption Augmented Lagrangian can enforce exact sparsity targets at layer or KV-head level
    Constraint mechanism stated without derivation or reference.

pith-pipeline@v0.9.1-grok · 5760 in / 1157 out tokens · 36095 ms · 2026-06-27T00:30:07.667178+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 21 canonical work pages · 6 internal anchors

  1. [1]

    Albert Q. Jiang and Alexandre Sablayrolles and Arthur Mensch and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Florian Bressand and Gianna Lengyel and Guillaume Lample and Lucile Saulnier and L. Mistral 7B , journal =. 2023 , url =. doi:10.48550/ARXIV.2310.06825 , eprinttype =. 2310.06825 , timestamp =

  2. [2]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2408.00118 , eprinttype =. 2408.00118 , timestamp =

  3. [3]

    CoRR , volume =

    Chen Zhang and Yang Bai and Jiahuan Li and Anchun Gui and Keheng Wang and Feifan Liu and Guanyu Wu and Yuwei Jiang and Defei Bu and Li Wei and Haihang Jing and Hongyin Tang and Xin Chen and Xiangzhou Huang and Fengcun Li and Rongxiang Weng and Yulei Qian and Yifan Lu and Yerui Sun and Jingang Wang and Yuchen Xie and Xunliang Cai , title =. CoRR , volume =...

  4. [4]

    findings-emnlp.214/

    Elena Voita and David Talbot and Fedor Moiseev and Rico Sennrich and Ivan Titov , editor =. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , booktitle =. 2019 , url =. doi:10.18653/V1/P19-1580 , timestamp =

  5. [5]

    What Does BERT Look at? An Analysis of BERT ' s Attention

    Kevin Clark and Urvashi Khandelwal and Omer Levy and Christopher D. Manning , editor =. What Does. Proceedings of the 2019. 2019 , url =. doi:10.18653/V1/W19-4828 , timestamp =

  6. [6]

    The Twelfth International Conference on Learning Representations,

    Mengzhou Xia and Tianyu Gao and Zhiyuan Zeng and Danqi Chen , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  7. [7]

    Peters and Arman Cohan , title =

    Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =. 2004.05150 , timestamp =

  8. [8]

    Big Bird: Transformers for Longer Sequences , booktitle =

    Manzil Zaheer and Guru Guruganesh and Kumar Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Onta. Big Bird: Transformers for Longer Sequences , booktitle =. 2020 , url =

  9. [9]

    CoRR , volume =

    Rewon Child and Scott Gray and Alec Radford and Ilya Sutskever , title =. CoRR , volume =. 2019 , url =. 1904.10509 , timestamp =

  10. [10]

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =

    Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =. 2020 , url =

  11. [11]

    8th International Conference on Learning Representations,

    Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

  12. [13]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.00752 , eprinttype =. 2312.00752 , timestamp =

  13. [14]

    Gerber and Elad Dolev and Eran Krakovsky and Erez Safahi and Erez Schwartz and Gal Cohen and et al

    Barak Lenz and Opher Lieber and Alan Arazi and Amir Bergman and Avshalom Manevich and Barak Peleg and Ben Aviram and Chen Almagor and Clara Fridman and Dan Padnos and Daniel Gissin and Daniel Jannai and Dor Muhlgay and Dor Zimberg and Edden M. Gerber and Elad Dolev and Eran Krakovsky and Erez Safahi and Erez Schwartz and Gal Cohen and et al. , title =. Th...

  14. [15]

    The Thirteenth International Conference on Learning Representations,

    Guangxuan Xiao and Jiaming Tang and Jingwei Zuo and Junxian Guo and Shang Yang and Haotian Tang and Yao Fu and Song Han , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  15. [16]

    CoRR , volume =

    Farnoosh Javadi and Walid Ahmed and Habib Hajimolahoseini and Foozhan Ataiefard and Mohammad Hassanpour and Saina Asani and Austin Wen and Omar Mohamed Awad and Kangling Liu and Yang Liu , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2311.03426 , eprinttype =. 2311.03426 , timestamp =

  16. [17]

    MiMo-V2-Flash Technical Report

    LLM. MiMo-V2-Flash Technical Report , journal =. 2026 , url =. doi:10.48550/ARXIV.2601.02780 , eprinttype =. 2601.02780 , timestamp =

  17. [18]

    PIQA: Reasoning about physical commonsense in natural language

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. The Thirty-Fourth. 2020 , url =. doi:10.1609/AAAI.V34I05.6239 , timestamp =

  18. [19]

    9th International Conference on Learning Representations,

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =

  19. [20]

    Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang , editor =. LogiQA:. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence,. 2020 , url =. doi:10.24963/IJCAI.2020/501 , timestamp =

  20. [21]

    C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge

    Alon Talmor and Jonathan Herzig and Nicholas Lourie and Jonathan Berant , editor =. CommonsenseQA:. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1421 , timestamp =

  21. [22]

    Social IQa: Commonsense Reasoning about Social Interactions , booktitle =

    Maarten Sap and Hannah Rashkin and Derek Chen and Ronan Le Bras and Yejin Choi , editor =. Social IQa: Commonsense Reasoning about Social Interactions , booktitle =. 2019 , url =. doi:10.18653/V1/D19-1454 , timestamp =

  22. [23]

    CoRR , volume =

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. CoRR , volume =. 2018 , url =. 1803.05457 , timestamp =

  23. [24]

    HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

    Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , editor =. HellaSwag: Can a Machine Really Finish Your Sentence? , booktitle =. 2019 , url =. doi:10.18653/V1/P19-1472 , timestamp =

  24. [25]

    2016 , eprint=

    Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering , author=. 2016 , eprint=

  25. [26]

    Proceedings of the 29th International Conference on Computational Linguistics,

    Yudong Li and Yuqing Zhang and Zhe Zhao and Linlin Shen and Weijie Liu and Weiquan Mao and Hui Zhang , editor =. Proceedings of the 29th International Conference on Computational Linguistics,. 2022 , url =

  26. [28]

    International Conference on Learning Representations , year=

    Learning Sparse Neural Networks through L\_0 Regularization , author=. International Conference on Learning Representations , year=

  27. [29]

    Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

    Yusheng Zhao and Hourun Li and Bohan Wu and Jingyang Yuan and Meng Zhang and Yichun Yin and Lifeng Shang and Ming Zhang , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.26380 , eprinttype =. 2603.26380 , timestamp =

  28. [30]

    Improving Reasoning Capabilities in Small Models through Mixture-of-layers Distillation with Stepwise Attention on Key Information

    Chen, Yao and Sheng, Jiawei and Zhang, Wenyuan and Liu, Tingwen. Improving Reasoning Capabilities in Small Models through Mixture-of-layers Distillation with Stepwise Attention on Key Information. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.250

  29. [31]

    Maurice Weber and Daniel Y. Fu and Quentin Anthony and Yonatan Oren and Shane Adams and Anton Alexandrov and Xiaozhong Lyu and Huu Nguyen and Xiaozhe Yao and Virginia Adams and Ben Athiwaratkun and Rahul Chalamala and Kezhen Chen and Max Ryabinin and Tri Dao and Percy Liang and Christopher R. RedPajama: an Open Dataset for Training Large Language Models ,...

  30. [32]

    CoRR , volume =

    Yijiong Yu and Ziyun Dai and Zekun Wang and Wei Wang and Ran Chen and Ji Pei , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.08197 , eprinttype =. 2501.08197 , timestamp =

  31. [33]

    LongBench:

    Yushi Bai and Xin Lv and Jiajie Zhang and Hongchang Lyu and Jiankai Tang and Zhidian Huang and Zhengxiao Du and Xiao Liu and Aohan Zeng and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li , editor =. LongBench:. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V...

  32. [35]

    Hugging Face

    Luyang Huang and Shuyang Cao and Nikolaus Nova Parulian and Heng Ji and Lu Wang , editor =. Efficient Attentions for Long Document Summarization , booktitle =. 2021 , url =. doi:10.18653/V1/2021.NAACL-MAIN.112 , timestamp =

  33. [36]

    shortcut

    Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , editor =. HotpotQA:. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 , pages =. 2018 , url =. doi:10.18653/V1/D18-1259 , timestamp =

  34. [37]

    Musique: Multihop questions via single-hop question composition.Trans

    Nelson F. Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and Fabio Petroni and Percy Liang , title =. Trans. Assoc. Comput. Linguistics , volume =. 2024 , url =. doi:10.1162/TACL\_A\_00638 , timestamp =

  35. [38]

    findings-emnlp.488/

    Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =

  36. [39]

    InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory , booktitle =

    Chaojun Xiao and Pengle Zhang and Xu Han and Guangxuan Xiao and Yankai Lin and Zhengyan Zhang and Zhiyuan Liu and Maosong Sun , editor =. InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory , booktitle =. 2024 , url =