arxiv: 2509.24552 · v3 · submitted 2025-09-29 · 💻 cs.LG · cs.AI

Short window attention enables long-term memorization

Lo\"ic Cabannes , Maximilian Beck , Gergely Szilvasy , Matthijs Douze , Maria Lomeli , Jade Copet , Pierre-Emmanuel Mazar\'e , Gabriel Synnaeve

show 1 more author

Herv\'e J\'egou

This is my paper

Pith reviewed 2026-05-18 12:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sliding window attentionxLSTMhybrid architectureslong-context modelingstochastic window sizelong-term memorylocal-global attention

0 comments

The pith

Short sliding windows strengthen long-term memory in hybrid attention-xLSTM models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies hybrid architectures that combine local sliding-window attention layers with global xLSTM linear RNN layers. It finds that smaller window sizes improve long-context performance because the model can no longer rely on attention for distant retrieval and must instead strengthen its xLSTM memory. The same pattern appears when alternating short-window and full-attention layers: the short layers must stay small to keep the full layers useful. Overly small fixed windows hurt short-context tasks, so the authors train models with randomly varying window sizes, which improves results on both short and long sequences.

Core claim

In the SWAX hybrid of sliding-window attention and xLSTM layers, larger sliding windows reduce long-context performance while shorter windows improve it by forcing the xLSTM to handle long-term retrieval that local attention can no longer cover. The same holds for local-global attention stacks, where short layers must remain small. Training with stochastic window sizes lets the model use both short-term local information and long-term memory, outperforming fixed-window baselines on short- and long-context problems.

What carries the argument

The sliding-window attention mechanism whose length controls how much retrieval is offloaded to the xLSTM linear RNN layers.

If this is right

Shorter fixed windows improve long-context results by increasing dependence on xLSTM memory.
In alternating local-global attention stacks, keeping short layers small preserves the value of full attention layers.
Stochastic variation of window size during training yields gains on both short- and long-context tasks over any fixed window.
Excessively small fixed windows degrade short-context performance that moderate windows could handle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architects of long-context systems may improve global memory by deliberately restricting local context windows.
The same short-window pressure could be tested on other recurrent or stateful modules paired with attention.
Measuring internal long-range retrieval accuracy before and after short-window training would directly test the claimed mechanism.

Load-bearing premise

Gains from shorter windows come from forcing greater use of xLSTM long-term memory rather than from incidental changes in gradients or regularization.

What would settle it

Train the same short-window model but add an auxiliary long-range retrieval path that bypasses the xLSTM; if long-context gains disappear, the memory-forcing account is supported.

read the original abstract

Recent works show that hybrid architectures combining local sliding window attention layers and global attention layers outperform either of these architectures taken separately. However, the impact of the window length and the interplay between local layers and global layers remain under-studied. In this work, we first analyze the interaction between short and long term memory by considering SWAX: a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding is that larger sliding windows hurts the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM as it cannot rely on the local softmax attention mechanism for long context-retrieval. We also validate our findings on local-global architectures alternating short window and full attention layers: the short layers should be small in order not to hinder the usefulness of the long layers. However, employing too small sliding windows is detrimental even for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train hybrid architectures by stochastically changing the sliding window size, forcing the model to leverage both the short term window and the long-term memory. Training with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SWAX, a hybrid architecture combining sliding-window attention layers with xLSTM linear RNN layers. It reports the counter-intuitive result that larger sliding windows degrade long-context performance, attributing this to short windows forcing greater reliance on and better training of the xLSTM long-term memory. The finding is extended to alternating short-window and full-attention hybrids, and a stochastic window-size training procedure is proposed that improves results on both short- and long-context tasks.

Significance. If the central mechanism holds, the work supplies a lightweight, parameter-free training intervention (stochastic window sizing) that could improve long-term memorization in hybrid attention-RNN models without increasing compute. The result would be practically useful for scaling context length in resource-constrained settings and would motivate further study of how local attention interacts with recurrent memory.

major comments (3)

[Abstract and experimental validation sections] The interpretation that short windows improve long-context performance specifically by compelling the xLSTM to learn better long-term memory (rather than through incidental effects on gradient flow, regularization, or effective capacity) is load-bearing for the central claim yet unsupported by isolating experiments. No memory-state ablations, gradient-norm measurements, or matched-capacity controls are described that would separate the proposed mechanism from these confounds.
[Validation on local-global architectures] The statement that 'short layers should be small in order not to hinder the usefulness of the long layers' is presented as a general guideline, but the manuscript provides no quantitative analysis of the interaction (e.g., performance curves versus window size for fixed long-layer capacity) or statistical significance of the reported gains.
[Training with stochastic window sizes] The stochastic window-size training method is claimed to outperform regular window attention on both short- and long-context problems, but the abstract and description lack details on the distribution from which window sizes are sampled, the frequency of resampling, and whether the improvement survives when total training compute is matched.

minor comments (2)

[Introduction / Architecture description] Notation for the hybrid layer ordering and the precise definition of 'short' versus 'long' context lengths should be clarified with a diagram or explicit equations early in the manuscript.
[Experimental setup] The manuscript would benefit from an explicit statement of the baseline models and hyper-parameter search protocol used for all reported comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and commit to revisions that provide additional experimental support and details without altering the core findings.

read point-by-point responses

Referee: [Abstract and experimental validation sections] The interpretation that short windows improve long-context performance specifically by compelling the xLSTM to learn better long-term memory (rather than through incidental effects on gradient flow, regularization, or effective capacity) is load-bearing for the central claim yet unsupported by isolating experiments. No memory-state ablations, gradient-norm measurements, or matched-capacity controls are described that would separate the proposed mechanism from these confounds.

Authors: We agree that isolating the proposed mechanism from potential confounds such as gradient flow or regularization effects would strengthen the central claim. Our existing results demonstrate that shorter windows consistently yield better long-context performance in the hybrid SWAX architecture, which we interpret as evidence of increased reliance on xLSTM long-term memory. However, we acknowledge the value of direct ablations. In the revised manuscript we will add memory-state analyses (e.g., inspecting or intervening on xLSTM hidden states), gradient-norm comparisons across window sizes, and matched-capacity controls that adjust for effective model capacity or regularization strength. revision: yes
Referee: [Validation on local-global architectures] The statement that 'short layers should be small in order not to hinder the usefulness of the long layers' is presented as a general guideline, but the manuscript provides no quantitative analysis of the interaction (e.g., performance curves versus window size for fixed long-layer capacity) or statistical significance of the reported gains.

Authors: We appreciate this feedback on the need for more rigorous quantification. The guideline is drawn from our experiments showing that larger short-window layers can diminish the contribution of the full-attention layers in alternating local-global setups. To address the concern, the revision will include performance curves of task metrics versus short-window size under fixed long-layer capacity, along with statistical significance testing (multiple random seeds and appropriate hypothesis tests) for the reported improvements. revision: yes
Referee: [Training with stochastic window sizes] The stochastic window-size training method is claimed to outperform regular window attention on both short- and long-context problems, but the abstract and description lack details on the distribution from which window sizes are sampled, the frequency of resampling, and whether the improvement survives when total training compute is matched.

Authors: We concur that these methodological details are essential for reproducibility and for confirming that gains are not artifacts of unequal compute. The revised manuscript will specify the exact sampling distribution (e.g., uniform over a defined range of window sizes), the resampling frequency (e.g., per batch or per epoch), and will include controlled experiments that match total training compute (by equating FLOPs or step counts) to verify that the stochastic-window approach retains its advantages on both short- and long-context benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims without derivation chain

full rationale

The paper reports experimental results comparing hybrid sliding-window attention and xLSTM architectures across different window sizes and stochastic training regimes. No equations, first-principles derivations, or fitted parameters are presented that reduce any claimed prediction to its own inputs by construction. Central findings (e.g., short windows improving long-context performance) rest on direct benchmark measurements rather than self-referential definitions or self-citation load-bearing steps. The work is self-contained against external replication and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that observed gains arise from memory specialization rather than confounding factors, plus standard assumptions about attention and RNN layers functioning as described in prior literature.

axioms (1)

domain assumption Hybrid local-global attention architectures can be stably trained and compared when window size is varied.
Invoked when claiming that short windows improve long-term memory training without destabilizing optimization.

pith-pipeline@v0.9.0 · 5788 in / 1174 out tokens · 32831 ms · 2026-05-18T12:07:23.023850+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

short window attention encourages the model to better train the long-term memory of the xLSTM as it cannot rely on the local softmax attention mechanism for long context-retrieval
IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we train hybrid architectures by stochastically changing the sliding window size

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 22 internal anchors

[1]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton

URL https://arxiv.org/abs/2402.18668. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models,

work page arXiv
[2]

Program Synthesis with Large Language Models

URL https://arxiv.org/abs/2108.07732. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Layer Normalization

URL https://arxiv.org/abs/ 1607.06450. Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

10 Maximilian Beck, Korbinian Pöppel, Phillip Lippe, and Sepp Hochreiter

URL https://arxiv.org/abs/2405.04517. 10 Maximilian Beck, Korbinian Pöppel, Phillip Lippe, and Sepp Hochreiter. Tiled Flash Linear Attention: More efficient linear rnn and xlstm kernels.arXiv, 2503.14376, 2025a. URLhttps://arxiv.org/abs/2503.14376. Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, Günter Klambauer, Sebasti...

work page arXiv
[5]

Longformer: The Long-Document Transformer

URL https: //arxiv.org/abs/2004.05150. Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

URLhttps://arxiv.org/abs/1911.11641. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Po...

work page internal anchor Pith review Pith/arXiv arXiv 1911
[7]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

URLhttps://arxiv.org/abs/1412.3555. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

URL https://arxiv.org/abs/ 1803.05457. Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. Griffin: Mixing gated linear recurr...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

URLhttps://arxiv.org/abs/2402.19427. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

URL https: //arxiv.org/abs/2501.12948. Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, and Pavlo Molchanov. Hymba: A hybrid-head architecture for small language models,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Stefan Elfwing, Eiji Uchibe, and Kenji Doya

URLhttps://arxiv.org/abs/2411.13676. Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,

work page arXiv
[12]

Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning

URLhttps://arxiv.org/abs/1702.03118. Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling?,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve

URLhttps://arxiv.org/abs/2212.14052. Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning,

work page arXiv
[14]

Google DeepMind Gemma Team

URL https://arxiv.org/ abs/2410.02089. Google DeepMind Gemma Team. Gemma 3 technical report,

work page arXiv
[15]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

URL https: //arxiv.org/abs/2312.00752. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9:1735–1780, 11

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg

doi: 10.1162/neco.1997.9.8.1735. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?,

work page doi:10.1162/neco.1997.9.8.1735 1997
[18]

RULER: What's the Real Context Size of Your Long-Context Language Models?

URL https://arxiv. org/abs/2404.06654. 11 Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Weld, and Luke Zettlemoyer

Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/ P17-1147/. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention,

work page doi:10.18653/v1/p17-1147
[20]

URLhttps://arxiv.org/abs/2006.16236. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for...

work page arXiv 2006
[21]

URL https://aclanthology.org/Q19-1026/

doi: 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026/. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations,

work page doi:10.1162/tacl_a_00276
[22]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

URLhttps://arxiv.org/abs/1704.04683. Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Hanxiao Liu, Zihang Dai, David R

URLhttps://arxiv.org/abs/2407.14207. Hanxiao Liu, Zihang Dai, David R. So, and Quoc V . Le. Pay attention to mlps,

work page arXiv
[24]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang

URL https://arxiv.org/abs/ 2105.08050. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems,

work page arXiv
[25]

URL https://arxiv.org/abs/2504.03624. OpenAI. gpt-oss-120b & gpt-oss-20b model card,

work page arXiv
[26]

gpt-oss-120b & gpt-oss-20b Model Card

URLhttps://arxiv.org/abs/2508.10925. Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

YaRN: Efficient Context Window Extension of Large Language Models

URLhttps://arxiv.org/abs/2309.00071. Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Ricardo Buitrago Ruiz and Albert Gu

URLhttps://arxiv.org/abs/2406.07522. Ricardo Buitrago Ruiz and Albert Gu. Understanding and improving length generalization in recurrent models,

work page arXiv
[29]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi

URLhttps://arxiv.org/abs/2507.02782. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale,

work page arXiv
[30]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

URLhttps://arxiv.org/abs/1907.10641. Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[31]

SocialIQA: Commonsense Reasoning about Social Interactions

URLhttps://arxiv.org/abs/1904.09728. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56):1929–1958,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[32]

RoFormer: Enhanced Transformer with Rotary Position Embedding

URLhttps://arxiv.org/abs/2104.09864. Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

URLhttps://arxiv.org/abs/2407.04620. Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, and Jason Eshraghian. A systematic analysis of hybrid linear attention,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

org/abs/2507.06457

URL https://arxiv. org/abs/2507.06457. 12 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models,

work page arXiv
[35]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

URL https://arxiv.org/abs/ 2201.11903. Guangxuan Xiao. Why stacking sliding windows can’t see very far. https://guangxuanx.com/blog/ stacking-swa.html,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Gated Linear Attention Transformers with Hardware-Efficient Training

URLhttps://arxiv.org/abs/2312.06635. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

HellaSwag: Can a Machine Really Finish Your Sentence?

URLhttps://arxiv.org/abs/1905.07830. Biao Zhang and Rico Sennrich. Root mean square layer normalization,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[38]

URL https://arxiv.org/abs/1910. 07467. Jianyu Zhang and Léon Bottou. Memory mosaics at scale.arXiv preprint arXiv:2507.03285,

work page arXiv 1910
[39]

URLhttps://arxiv.org/abs/2505.19488. 13 SUPPLEMENTARY MATERIAL A Results of pure SWA models In section 4.3 we hypothesize that the worse performance of SWAX models with long windows comes from the model utilizing the SWA layers instead of the xLSTM layers. To further confirm this hypothesis, we train a 1.4B pure SWA model with a window size of 2048 and co...

work page arXiv 2048
[40]

benchmark is an extension of HumanEval (Chen et al., 2021), which is designed to evaluate the functional correctness of code generated by AI models. 14 model xLSTM SWAX SWAX parameters 7B 7B 1.4B train-time window NA 128p=0.9p=0.75p 90%=0.75p=0.5 2048 p=0.5p 90%=0.5 test-time window NA 128 2048 2048 2048 2048 2048 2048 2048 niah_single 61.20 62.43 58.9963...

work page arXiv 2021
[41]

NIAH single and multikey results are the average overall all 3 sub-tasks for each

p90% indicates annealing, i.e., only doing the stochastic window size for the first 90% of the training and then using a fixed window size of 2048 for the rest of training. NIAH single and multikey results are the average overall all 3 sub-tasks for each. • MBPP (Austin et al.,

work page 2048
[42]

is designed to evaluate the code generation abilities of AI models, particularly for Python programming tasks. Common sense and general reasoning.We use benchmarks consisting of question-answer or multiple- choice questions designed to evaluate the commonsense reasoning abilities of AI models, particularly in the context of natural language understanding:...

work page 2019
[43]

and TQA (Joshi et al., 2017). 15

work page 2017