DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

Junpeng Jiang; Liqiang Nie; Miao Zhang; Weili Guan; Weizhe Chen; Yaping Li

arxiv: 2605.20936 · v1 · pith:43MUQF2Znew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CL

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

Weizhe Chen , Miao Zhang , Junpeng Jiang , Yaping Li , Weili Guan , Liqiang Nie This is my paper

Pith reviewed 2026-05-21 06:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords hybrid attentiondifferentiable architecture searchLLM efficiencyNASattention mechanismsinference optimizationlong-context modeling

0 comments

The pith

DASH shows that differentiable search can discover high-performing hybrid attention architectures for LLMs in about 20 minutes on a single GPU using only 12.3 million tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that relaxing discrete layer-wise choices of attention operators into continuous logits allows efficient architecture search while keeping model weights frozen. This approach prepares reusable teacher-aligned candidates and finds hybrid designs that outperform existing selector-based methods on models like Qwen2.5-3B-Instruct. A sympathetic reader would care because it turns hybrid architecture design from an expensive, token-heavy process into a routine, minutes-level task that still yields stronger long-context performance than previous automated searches. The results indicate that high-quality designs emerge directly from this lightweight optimization without needing billions of tokens for search.

Core claim

By converting discrete operator placements per layer into continuous architecture logits and optimizing only these with frozen weights after preparing linear candidates aligned to a teacher, DASH discovers discrete hybrid attention architectures. On Qwen2.5-3B-Instruct these architectures consistently beat selector-style baselines and achieve better RULER scores than released Jet-Nemotron models, all while using 12.3M tokens and 20 minutes on one GPU, which is 0.006% of the tokens used in prior PostNAS stages.

What carries the argument

The relaxation of discrete layer-wise attention operator placement into continuous architecture logits, optimized with model and operator weights frozen.

If this is right

Hybrid attention architectures can be designed through direct differentiable optimization rather than manual rules or proxy selectors.
Search for such architectures becomes feasible as a routine step requiring minimal compute and data.
Discovered designs preserve competitiveness on short-context and general tasks while improving on long-context benchmarks.
Architecture quality found under frozen weights transfers to end-to-end evaluation and training scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the search consistently produces strong architectures across different base models, it could standardize hybrid attention as a default efficiency technique.
This efficiency might allow exploring hybrid designs at larger scales where full searches were previously prohibitive.
Extending the candidate preparation to other operator types could broaden the applicability beyond attention hybrids.

Load-bearing premise

That the architectures discovered by searching only the continuous logits with frozen weights will retain their performance advantages once the full model including those operators is trained or fine-tuned end-to-end.

What would settle it

A direct comparison where a DASH-discovered architecture is trained from scratch alongside a strong baseline architecture and shows no improvement or worse results on RULER and other benchmarks.

Figures

Figures reproduced from arXiv: 2605.20936 by Junpeng Jiang, Liqiang Nie, Miao Zhang, Weili Guan, Weizhe Chen, Yaping Li.

**Figure 2.** Figure 2: Effect of the budget coefficient λ. Bars show the realized budget, and lines show the final RULER score. Lower λ weakens the cost penalty, leading to larger realized budgets and higher RULER scores in both the binary FULL/LINEAR and tri-state FULL/WINDOW/LINEAR search spaces. Final RULER scores in this diagnostic sweep are measured after 300M-token Stage 3 distillation [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 3.** Figure 3: Inference efficiency at batch size 16. DASH@ [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Final discrete operator allocations across layers. Each row corresponds to one evaluated [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Binary DASH top-1 operator allocations across realized budgets. Each row corresponds to [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Tri-state DASH top-1 operator allocations across realized budgets. Each row corresponds [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Layer-wise routing distributions for a representative low-budget tri-state pair with and [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DASH makes hybrid attention search cheap enough for routine use by freezing weights and relaxing choices to logits, but the frozen proxy needs direct checks against end-to-end results.

read the letter

Dear colleague, The main point is that DASH finds hybrid attention architectures for models like Qwen2.5-3B in about 20 minutes on one GPU using only 12.3M tokens, and the resulting designs beat selector baselines plus released Jet-Nemotron models on RULER while staying competitive on short-context tasks. The approach relaxes discrete layer choices into continuous logits, prepares teacher-aligned linear candidates, and searches architecture only while keeping model and operator weights frozen. This cuts the token budget to a tiny fraction of Jet-Nemotron's PostNAS stage and shows that direct differentiable search can outperform manual or proxy-selector rules for this problem. The efficiency gain is real and practical. The soft spot is the missing validation that architectures chosen under frozen weights keep their ranking or absolute quality once the full model trains or evaluates end-to-end. If the proxy objective diverges from the adaptive regime, the reported gains rest on an untested assumption. The abstract gives no explicit correlation check or ablation on this transfer, which is the load-bearing claim. Minor points include the absence of error bars on the comparisons and limited description of how the search procedure itself was stabilized. This work is aimed at people building efficient LLM inference pipelines who want a low-cost way to explore hybrid attention without 200B-token searches. A reader who needs quick architecture candidates for balancing quality and speed would get concrete value. I would send it to peer review. The compute reduction is substantial enough to justify referee time, provided the proxy-to-final transfer gets more direct evidence.

Referee Report

2 major / 2 minor

Summary. The paper introduces DASH, a differentiable architecture search method for hybrid attention in LLMs. It relaxes discrete layer-wise attention operator choices into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen. On Qwen2.5-3B-Instruct, the discovered architectures outperform selector-style baselines, achieve stronger RULER performance than released Jet-Nemotron models, and remain competitive on short-context and general benchmarks, with each search using only 12.3M tokens and ~20 minutes on a single RTX Pro 6000 GPU (0.006% of Jet-Nemotron PostNAS tokens).

Significance. If the central claims hold, DASH would make automated hybrid attention design practical and routine by reducing search cost by four orders of magnitude relative to prior NAS pipelines. This could accelerate iteration on efficient LLM inference architectures and shift the field from manual rules or expensive selector proxies toward direct differentiable search. The work provides concrete evidence that high-quality hybrids are discoverable under a frozen-weight proxy, which—if the proxy is reliable—represents a substantial efficiency gain.

major comments (2)

[Method section] Method section (architecture search procedure): The central performance claims rest on the assumption that architectures selected via differentiable search with frozen model and operator weights will preserve their quality ranking and absolute performance once the full model is trained or evaluated end-to-end. The manuscript provides no explicit validation experiment (e.g., an ablation that measures proxy scores versus final RULER or benchmark scores across multiple candidate architectures after unfreezing and retraining), which is load-bearing for interpreting the reported gains over Jet-Nemotron and selector baselines.
[Experimental results] Experimental results (Qwen2.5-3B-Instruct evaluation): While the abstract and results claim consistent outperformance and stronger RULER scores, the support would be strengthened by reporting error bars or multiple random seeds for the discovered architectures, as the search procedure itself involves stochastic elements in the continuous relaxation and discretization step.

minor comments (2)

[Figure 1] Figure 1 (search pipeline diagram): The distinction between frozen components and the architecture logits could be labeled more explicitly to clarify what remains fixed during the 20-minute search.
[Related work] Related work: The positioning against Jet-Nemotron would benefit from a brief quantitative comparison table of search token counts and wall-clock time in addition to the narrative description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of validating the frozen-weight proxy and quantifying stochasticity in the search procedure. We address each point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Method section] Method section (architecture search procedure): The central performance claims rest on the assumption that architectures selected via differentiable search with frozen model and operator weights will preserve their quality ranking and absolute performance once the full model is trained or evaluated end-to-end. The manuscript provides no explicit validation experiment (e.g., an ablation that measures proxy scores versus final RULER or benchmark scores across multiple candidate architectures after unfreezing and retraining), which is load-bearing for interpreting the reported gains over Jet-Nemotron and selector baselines.

Authors: We agree that an explicit proxy-to-final validation would further support the claims. DASH is explicitly designed around a frozen-weight, architecture-only search using teacher-aligned linear candidates to enable minutes-scale discovery. The architectures found under this proxy are then directly instantiated and evaluated end-to-end on RULER and other benchmarks, where they outperform both selector baselines and released Jet-Nemotron models. This provides empirical evidence that the proxy produces high-quality hybrids in practice. To address the concern directly, we will add a new ablation in the revised manuscript that reports proxy scores alongside final benchmark performance for a set of candidate architectures sampled during search. revision: yes
Referee: [Experimental results] Experimental results (Qwen2.5-3B-Instruct evaluation): While the abstract and results claim consistent outperformance and stronger RULER scores, the support would be strengthened by reporting error bars or multiple random seeds for the discovered architectures, as the search procedure itself involves stochastic elements in the continuous relaxation and discretization step.

Authors: The continuous relaxation is optimized via gradient descent on architecture logits, and the final discretization selects the highest-probability operator per layer. While the optimization trajectory can vary with initialization, we acknowledge that reporting variability would improve robustness. In the revised version we will run the full DASH search three times with independent random seeds on the same base model, report the resulting architectures and their RULER scores with standard deviation, and include these statistics in the main results table. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external benchmark comparisons

full rationale

The paper introduces DASH as a differentiable search procedure that relaxes discrete operator choices to continuous logits, prepares teacher-aligned linear candidates, and runs architecture search with weights frozen. Reported gains are obtained by direct empirical comparison against selector-style baselines and released Jet-Nemotron models on RULER, short-context, and general benchmarks. No equation or self-citation chain is shown that reduces the discovered architectures or their performance numbers to quantities defined by construction inside the paper. The central result therefore remains an independent empirical finding rather than a tautological restatement of the search procedure itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard differentiable NAS assumption that continuous relaxation of discrete layer choices yields useful architectures when weights are held fixed; no new free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Relaxing discrete layer-wise attention operator placement into continuous architecture logits preserves the quality of the eventual discrete solution.
This is the core relaxation step that enables gradient-based search; it is invoked when the paper states that discrete placement is relaxed into continuous logits.

pith-pipeline@v0.9.0 · 5814 in / 1380 out tokens · 36957 ms · 2026-05-21T06:15:11.635618+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DASH relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lsearch = LKL + λ Lcost with c = (1, w/T, 0) for FULL/WINDOW/LINEAR

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

CoRR , volume =

NVIDIA , title =. CoRR , volume =. 2025 , url =

work page 2025
[2]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

work page 2017
[3]

Proceedings of the 37th International Conference on Machine Learning,

Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Proceedings of the 37th International Conference on Machine Learning,. 2020 , url =

work page 2020
[4]

The Thirteenth International Conference on Learning Representations,

Michael Zhang and Simran Arora and Rahul Chalamala and Benjamin Frederick Spector and Alan Wu and Krithik Ramesh and Aaryan Singhal and Christopher R. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025
[5]

CoRR , volume =

Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah , title =. CoRR , volume =. 2025 , url =

work page 2025
[6]

CoRR , volume =

Yanhong Li and Songlin Yang and Shawn Tan and Mayank Mishra and Rameswar Panda and Jiawei Zhou and Yoon Kim , title =. CoRR , volume =. 2025 , url =

work page 2025
[7]

CoRR , volume =

Yuxian Gu and Qinghao Hu and Shang Yang and Haocheng Xi and Junyu Chen and Song Han and Han Cai , title =. CoRR , volume =. 2025 , url =

work page 2025
[8]

CoRR , volume =

Albert Gu and Tri Dao , title =. CoRR , volume =. 2023 , url =

work page 2023
[9]

7th International Conference on Learning Representations,

Hanxiao Liu and Karen Simonyan and Yiming Yang , title =. 7th International Conference on Learning Representations,. 2019 , url =

work page 2019
[10]

The Thirteenth International Conference on Learning Representations,

Songlin Yang and Jan Kautz and Ali Hatamizadeh , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025
[11]

CoRR , volume =

An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mi...

work page 2024
[12]

CoRR , volume =

Cheng. CoRR , volume =. 2024 , url =

work page 2024
[13]

CoRR , volume =

Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. CoRR , volume =. 2018 , url =

work page 2018
[14]

CoRR , volume =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. CoRR , volume =. 2019 , url =

work page 2019
[15]

The Thirty-Fourth

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. The Thirty-Fourth. 2020 , url =

work page 2020
[16]

Kakade and Eran Malach , editor =

Samy Jelassi and David Brandfonbrener and Sham M. Kakade and Eran Malach , editor =. Repeat After Me:. Forty-first International Conference on Machine Learning,. 2024 , url =

work page 2024
[17]

CoRR , volume =

Ruisheng Cao and Mouxiang Chen and Jiawei Chen and Zeyu Cui and Yunlong Feng and Binyuan Hui and Yuheng Jing and Kaixin Li and Mingze Li and Junyang Lin and Zeyao Ma and Kashun Shum and Xuwu Wang and Jinxi Wei and Jiaxi Yang and Jiajun Zhang and Lei Zhang and Zongmeng Zhang and Wenting Zhao and Fan Zhou , title =. CoRR , volume =. 2026 , url =

work page 2026
[18]

A Systematic Analysis of Hybrid Linear Attention , journal =

Dustin Wang and Rui. A Systematic Analysis of Hybrid Linear Attention , journal =. 2025 , url =

work page 2025
[19]

CoRR , volume =

Xiaojie Xia and Huigang Zhang and Chaoliang Zhong and Jun Sun and Yusuke Oishi , title =. CoRR , volume =. 2026 , url =

work page 2026
[20]

Yu Zhang and Zongyu Lin and Xingcheng Yao and Jiaxi Hu and Fanqing Meng and Chengyin Liu and Xin Men and Songlin Yang and Zhiyuan Li and Wentao Li and Enzhe Lu and Weizhou Liu and Yanru Chen and Weixin Xu and Longhui Yu and Yejie Wang and Yu Fan and Longguang Zhong and Enming Yuan and Dehao Zhang and Yizhi Zhang and T. Y. Liu and Haiming Wang and Shengjun...

work page 2025
[21]

Rush and Tri Dao , editor =

Junxiong Wang and Daniele Paliotta and Avner May and Alexander M. Rush and Tri Dao , editor =. The. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , year =

work page 2024
[22]

ICLR 2025 Workshop on New Frontiers in Associative Memories , year=

Test-time scaling meets associative memory: Challenges in subquadratic models , author=. ICLR 2025 Workshop on New Frontiers in Associative Memories , year=

work page 2025
[23]

Ahderom and Najmeh Samadiani and Andrew Maiorana and Girish Dwivedi , title =

Najmeh Fayyazifar and Selam T. Ahderom and Najmeh Samadiani and Andrew Maiorana and Girish Dwivedi , title =. Proceedings of the 15th International Conference on Machine Learning and Computing,. 2023 , url =

work page 2023
[24]

Xiangxiang Chu and Tianbao Zhou and Bo Zhang and Jixiang Li , editor =. Fair. Computer Vision -. 2020 , url =

work page 2020
[25]

arXiv preprint arXiv:2402.02834 , volume=

Bo-Kyeong Kim and Geonmin Kim and Tae-Ho Kim and Thibault Castells and Shinkook Choi and Junho Shin and Hyoung-Kyu Song , year=. Shortened. 2402.02834 , archivePrefix=

work page arXiv
[26]

CoRR , volume =

Mingyu Yang and Mehdi Rezagholizadeh and Guihong Li and Vikram Appia and Emad Barsoum , title =. CoRR , volume =. 2025 , url =

work page 2025
[27]

Gerber and Elad Dolev and Eran Krakovsky and Erez Safahi and Erez Schwartz and Gal Cohen and et al

Barak Lenz and Opher Lieber and Alan Arazi and Amir Bergman and Avshalom Manevich and Barak Peleg and Ben Aviram and Chen Almagor and Clara Fridman and Dan Padnos and Daniel Gissin and Daniel Jannai and Dor Muhlgay and Dor Zimberg and Edden M. Gerber and Elad Dolev and Eran Krakovsky and Erez Safahi and Erez Schwartz and Gal Cohen and et al. , title =. Th...

work page 2025
[28]

Smith , editor =

Jungo Kasai and Hao Peng and Yizhe Zhang and Dani Yogatama and Gabriel Ilharco and Nikolaos Pappas and Yi Mao and Weizhu Chen and Noah A. Smith , editor =. Finetuning Pretrained. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,. 2021 , url =

work page 2021
[29]

Le , title =

Barret Zoph and Quoc V. Le , title =. 5th International Conference on Learning Representations,. 2017 , url =

work page 2017
[30]

Le , title =

Esteban Real and Alok Aggarwal and Yanping Huang and Quoc V. Le , title =. The Thirty-Third. 2019 , url =

work page 2019
[31]

Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng

Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Scott Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F. Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan ...

work page 2024
[32]

Peters and Arman Cohan , title =

Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =

work page 2020
[33]

9th International Conference on Learning Representations,

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =

work page 2021
[34]

7th International Conference on Learning Representations,

Ilya Loshchilov and Frank Hutter , title =. 7th International Conference on Learning Representations,. 2019 , url =

work page 2019
[35]

CoRR , volume =

Zichuan Fu and Wentao Song and Yejing Wang and Xian Wu and Yefeng Zheng and Yingying Zhang and Derong Xu and Xuetao Wei and Tong Xu and Xiangyu Zhao , title =. CoRR , volume =. 2025 , url =

work page 2025

[1] [1]

CoRR , volume =

NVIDIA , title =. CoRR , volume =. 2025 , url =

work page 2025

[2] [2]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

work page 2017

[3] [3]

Proceedings of the 37th International Conference on Machine Learning,

Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Proceedings of the 37th International Conference on Machine Learning,. 2020 , url =

work page 2020

[4] [4]

The Thirteenth International Conference on Learning Representations,

Michael Zhang and Simran Arora and Rahul Chalamala and Benjamin Frederick Spector and Alan Wu and Krithik Ramesh and Aaryan Singhal and Christopher R. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025

[5] [5]

CoRR , volume =

Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah , title =. CoRR , volume =. 2025 , url =

work page 2025

[6] [6]

CoRR , volume =

Yanhong Li and Songlin Yang and Shawn Tan and Mayank Mishra and Rameswar Panda and Jiawei Zhou and Yoon Kim , title =. CoRR , volume =. 2025 , url =

work page 2025

[7] [7]

CoRR , volume =

Yuxian Gu and Qinghao Hu and Shang Yang and Haocheng Xi and Junyu Chen and Song Han and Han Cai , title =. CoRR , volume =. 2025 , url =

work page 2025

[8] [8]

CoRR , volume =

Albert Gu and Tri Dao , title =. CoRR , volume =. 2023 , url =

work page 2023

[9] [9]

7th International Conference on Learning Representations,

Hanxiao Liu and Karen Simonyan and Yiming Yang , title =. 7th International Conference on Learning Representations,. 2019 , url =

work page 2019

[10] [10]

The Thirteenth International Conference on Learning Representations,

Songlin Yang and Jan Kautz and Ali Hatamizadeh , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025

[11] [11]

CoRR , volume =

An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mi...

work page 2024

[12] [12]

CoRR , volume =

Cheng. CoRR , volume =. 2024 , url =

work page 2024

[13] [13]

CoRR , volume =

Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. CoRR , volume =. 2018 , url =

work page 2018

[14] [14]

CoRR , volume =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. CoRR , volume =. 2019 , url =

work page 2019

[15] [15]

The Thirty-Fourth

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. The Thirty-Fourth. 2020 , url =

work page 2020

[16] [16]

Kakade and Eran Malach , editor =

Samy Jelassi and David Brandfonbrener and Sham M. Kakade and Eran Malach , editor =. Repeat After Me:. Forty-first International Conference on Machine Learning,. 2024 , url =

work page 2024

[17] [17]

CoRR , volume =

Ruisheng Cao and Mouxiang Chen and Jiawei Chen and Zeyu Cui and Yunlong Feng and Binyuan Hui and Yuheng Jing and Kaixin Li and Mingze Li and Junyang Lin and Zeyao Ma and Kashun Shum and Xuwu Wang and Jinxi Wei and Jiaxi Yang and Jiajun Zhang and Lei Zhang and Zongmeng Zhang and Wenting Zhao and Fan Zhou , title =. CoRR , volume =. 2026 , url =

work page 2026

[18] [18]

A Systematic Analysis of Hybrid Linear Attention , journal =

Dustin Wang and Rui. A Systematic Analysis of Hybrid Linear Attention , journal =. 2025 , url =

work page 2025

[19] [19]

CoRR , volume =

Xiaojie Xia and Huigang Zhang and Chaoliang Zhong and Jun Sun and Yusuke Oishi , title =. CoRR , volume =. 2026 , url =

work page 2026

[20] [20]

Yu Zhang and Zongyu Lin and Xingcheng Yao and Jiaxi Hu and Fanqing Meng and Chengyin Liu and Xin Men and Songlin Yang and Zhiyuan Li and Wentao Li and Enzhe Lu and Weizhou Liu and Yanru Chen and Weixin Xu and Longhui Yu and Yejie Wang and Yu Fan and Longguang Zhong and Enming Yuan and Dehao Zhang and Yizhi Zhang and T. Y. Liu and Haiming Wang and Shengjun...

work page 2025

[21] [21]

Rush and Tri Dao , editor =

Junxiong Wang and Daniele Paliotta and Avner May and Alexander M. Rush and Tri Dao , editor =. The. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , year =

work page 2024

[22] [22]

ICLR 2025 Workshop on New Frontiers in Associative Memories , year=

Test-time scaling meets associative memory: Challenges in subquadratic models , author=. ICLR 2025 Workshop on New Frontiers in Associative Memories , year=

work page 2025

[23] [23]

Ahderom and Najmeh Samadiani and Andrew Maiorana and Girish Dwivedi , title =

Najmeh Fayyazifar and Selam T. Ahderom and Najmeh Samadiani and Andrew Maiorana and Girish Dwivedi , title =. Proceedings of the 15th International Conference on Machine Learning and Computing,. 2023 , url =

work page 2023

[24] [24]

Xiangxiang Chu and Tianbao Zhou and Bo Zhang and Jixiang Li , editor =. Fair. Computer Vision -. 2020 , url =

work page 2020

[25] [25]

arXiv preprint arXiv:2402.02834 , volume=

Bo-Kyeong Kim and Geonmin Kim and Tae-Ho Kim and Thibault Castells and Shinkook Choi and Junho Shin and Hyoung-Kyu Song , year=. Shortened. 2402.02834 , archivePrefix=

work page arXiv

[26] [26]

CoRR , volume =

Mingyu Yang and Mehdi Rezagholizadeh and Guihong Li and Vikram Appia and Emad Barsoum , title =. CoRR , volume =. 2025 , url =

work page 2025

[27] [27]

Gerber and Elad Dolev and Eran Krakovsky and Erez Safahi and Erez Schwartz and Gal Cohen and et al

Barak Lenz and Opher Lieber and Alan Arazi and Amir Bergman and Avshalom Manevich and Barak Peleg and Ben Aviram and Chen Almagor and Clara Fridman and Dan Padnos and Daniel Gissin and Daniel Jannai and Dor Muhlgay and Dor Zimberg and Edden M. Gerber and Elad Dolev and Eran Krakovsky and Erez Safahi and Erez Schwartz and Gal Cohen and et al. , title =. Th...

work page 2025

[28] [28]

Smith , editor =

Jungo Kasai and Hao Peng and Yizhe Zhang and Dani Yogatama and Gabriel Ilharco and Nikolaos Pappas and Yi Mao and Weizhu Chen and Noah A. Smith , editor =. Finetuning Pretrained. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,. 2021 , url =

work page 2021

[29] [29]

Le , title =

Barret Zoph and Quoc V. Le , title =. 5th International Conference on Learning Representations,. 2017 , url =

work page 2017

[30] [30]

Le , title =

Esteban Real and Alok Aggarwal and Yanping Huang and Quoc V. Le , title =. The Thirty-Third. 2019 , url =

work page 2019

[31] [31]

Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng

Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Scott Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F. Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan ...

work page 2024

[32] [32]

Peters and Arman Cohan , title =

Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =

work page 2020

[33] [33]

9th International Conference on Learning Representations,

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =

work page 2021

[34] [34]

7th International Conference on Learning Representations,

Ilya Loshchilov and Frank Hutter , title =. 7th International Conference on Learning Representations,. 2019 , url =

work page 2019

[35] [35]

CoRR , volume =

Zichuan Fu and Wentao Song and Yejing Wang and Xian Wu and Yefeng Zheng and Yingying Zhang and Derong Xu and Xuetao Wei and Tong Xu and Xiangyu Zhao , title =. CoRR , volume =. 2025 , url =

work page 2025