pith. machine review for the scientific record. sign in

arxiv: 2604.22575 · v1 · submitted 2026-04-24 · 💻 cs.LG

Recognition: unknown

SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

Anjie Hu, Bohan Sun, Bo Xu, Fangzhi Zhong, Guoqi Li, Han Xu, Jinghao Zhuang, Kun Yang, Lingtao Ouyang, Shaowei Gu, Shurong Wang, Siyu Ding, Xuerui Qiu, Yibo Zhong, Yuhong Chou, Yupeng Feng, Yuqi Pan, Zehao Liu, Zhiyong Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:14 UTC · model grok-4.3

classification 💻 cs.LG
keywords spiking modelslong-context inferencesparse attentiontransformer conversionneuromorphic computingefficient quantizationmultimodal modelsfoundation models
0
0 comments X

The pith

A hybrid sparse attention and spiking quantization design lets a 5B model recover most base transformer performance while delivering 10x faster long-context inference and neuromorphic hardware gains after under 7k GPU hours of conversion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that brain-inspired mechanisms can be embedded into foundation models through a hybrid attention architecture and a lightweight conversion process to overcome the memory and compute barriers of full-attention transformers at multi-million token lengths. It shows this is possible by combining two forms of sparse attention across layers, adding dual quantization routes for different hardware, and using an optimized pipeline to adapt existing models with curated data and minimal extra training. A reader would care because this combination keeps capability close to the original while unlocking inference speeds and platform flexibility that standard models cannot achieve at scale. The work therefore points toward foundation models that remain practical for very long contexts and edge or specialized hardware without starting from scratch.

Core claim

SpB2.0 is a 5B-parameter model built on Dual-Space Sparse Attention, an inter-layer mix of sparse softmax attention and sparse linear attention, together with dual quantization paths that support both INT8 event-driven spiking computation and FP8 GPU acceleration. An optimized Transformer-to-Hybrid conversion pipeline applied to a Qwen3-4B base model, using only curated open-source data and under 7k A100 GPU hours, recovers most of the original capability for both language and vision-language variants. The resulting model achieves a 10.13 times speedup in time-to-first-token at 4 million context length, supports more than 10 million tokens on eight A100 GPUs under vLLM where full-attention 4

What carries the argument

Dual-Space Sparse Attention (DSSA) is the central mechanism: an inter-layer hybrid of Sparse Softmax Attention and Sparse Linear Attention that improves the performance-efficiency trade-off for long sequences while enabling the dual quantization paths.

If this is right

  • Context lengths exceeding 10 million tokens become feasible on eight A100 GPUs under vLLM where full-attention models run out of memory.
  • Time-to-first-token improves by a factor of 10.13 at 4 million context length.
  • Neuromorphic hardware at 500 MHz gains 70.6 percent area reduction and 46.5 percent power reduction from the 64.31 percent sparsity.
  • Both language-only and vision-language models can follow the same dual-path conversion and retain most base performance with low additional compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid attention and conversion approach could be tested on larger base models to see whether the efficiency gains scale without proportional increases in training cost.
  • The spiking quantization path suggests potential for lower-power deployment on edge devices that support event-driven computation.
  • Cross-platform results imply that existing transformer checkpoints can be adapted for mixed GPU and neuromorphic environments with modest effort.
  • If the sparsity and recovery hold, energy use for long-context serving in data centers could decrease substantially.

Load-bearing premise

The transformer-to-hybrid conversion pipeline with curated open-source data recovers nearly all base-model capability without hidden losses on tasks or data distributions outside the reported evaluations.

What would settle it

A clear drop below the claimed recovery level on standard long-context or multimodal benchmarks, or on tasks outside the training distribution, when compared directly to the unmodified Qwen3-4B base model would show the conversion does not preserve capabilities as stated.

Figures

Figures reproduced from arXiv: 2604.22575 by Anjie Hu, Bohan Sun, Bo Xu, Fangzhi Zhong, Guoqi Li, Han Xu, Jinghao Zhuang, Kun Yang, Lingtao Ouyang, Shaowei Gu, Shurong Wang, Siyu Ding, Xuerui Qiu, Yibo Zhong, Yuhong Chou, Yupeng Feng, Yuqi Pan, Zehao Liu, Zhiyong Qin.

Figure 1
Figure 1. Figure 1: Architecture of SpikingBrain2.0-5B (SpB2.0). SpB2.0 adopts a 1:3 inter-layer hybrid design, termed DSSA, that combines MoBA and SSE, together with dual-path activation-coding strategies for linear projections. This design allows SpB2.0 to address the dominant computational bottlenecks of standard Transformers across different sequence-length regimes and hardware platforms. complementary designs. First, it … view at source ↗
Figure 2
Figure 2. Figure 2: Layerwise performance sensitivity and resulting layer assignment for SpB2.0-5B. Left: each point denotes the performance of a candidate model obtained by replacing a single FA layer with SSE. Dashed lines indicate the Qwen3 baseline performance on MMLU and LongBench. Layers whose replacement causes sharp degradation are selected as MoBA layers. Right: the resulting hybrid layer assignment, where the final … view at source ↗
Figure 3
Figure 3. Figure 3: Dual quantization paths in SpikingBrain2.0. The FP8 path targets practical inference acceleration by executing FP8 MatMul on NVIDIA Hopper Tensor Cores. The INT8-Spiking path converts activations into sparse spike sequences, enabling sparse event-driven accumulation on asynchronous neuromor￾phic hardware. Together, the two paths support efficient deployment across both mainstream GPU platforms and neuromor… view at source ↗
Figure 4
Figure 4. Figure 4: Training pipelines for Transformer-to-Hybrid (T2H) conversion in SpikingBrain2.0. SpB2.0 adopts dedicated conversion paths for LLMs and VLMs, enabling efficient architectural migration across both language-only and vision-language settings view at source ↗
Figure 5
Figure 5. Figure 5: MMLU score versus distillation training data size. Dashed lines indicate the Qwen3 MMLU baselines, including the original score and the score after training on our 200k CT dataset. Long-context extension After short-context distillation, we conduct progressive long-context continual training to restore the long-context capa￾bility of the target hybrid model. Even when the base Transformer already supports … view at source ↗
read the original abstract

Scaling context length is reshaping large-model development, yet full-attention Transformers suffer from prohibitive computation and inference bottlenecks at long sequences. A key challenge is to design foundation models that maintain performance and long-context efficiency with minimal training overhead. We introduce SpikingBrain2.0 (SpB2.0), a 5B model that advances both architecture and training efficiency of its predecessor. Our contributions are two-fold. (1) Architectural Innovation: We propose Dual-Space Sparse Attention (DSSA), an inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE), achieving an improved performance-efficiency trade-off for long-context modeling. SpB2.0 further supports dual quantization paths: INT8-Spiking coding enables sparse event-driven computation, while FP8 coding accelerates inference on modern GPUs. (2) Enhanced Training Strategy: We develop an optimized Transformer-to-Hybrid (T2H) pipeline with dual conversion paths for LLMs and VLMs using curated open-source data. Empirically, SpB2.0-5B and SpB2.0-VL-5B recover most of the base Transformer (Qwen3-4B) capability with under 7k A100 GPU hours. SpB2.0 achieves a 10.13x TTFT speedup at 4M context and supports over 10M tokens on 8 A100 GPUs under vLLM, where full-attention models exceed memory limits. It also demonstrates strong cross-platform compatibility, enabling FP8 GPU inference (2.52x speedup at 250k) and efficient neuromorphic execution (64.31% sparsity, with 70.6% and 46.5% area and power reduction at 500MHz). Overall, SpikingBrain2.0 provides a practical pathway for lightweight, multimodal, spiking foundation models, highlighting the potential of combining brain-inspired mechanisms with efficient architectures for resource-constrained and edge scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SpikingBrain2.0 (SpB2.0), a 5B-parameter brain-inspired foundation model advancing its predecessor via Dual-Space Sparse Attention (DSSA), an inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE). It further proposes a Transformer-to-Hybrid (T2H) conversion pipeline with dual paths for LLMs and VLMs on curated open-source data, plus dual quantization (INT8-Spiking for event-driven computation and FP8 for GPU acceleration). Empirically, SpB2.0-5B and SpB2.0-VL-5B are claimed to recover most Qwen3-4B capability with under 7k A100 GPU hours, deliver 10.13x TTFT speedup at 4M context, support >10M tokens on 8 A100 GPUs under vLLM (where full attention exceeds limits), and achieve 64.31% sparsity with 70.6% area and 46.5% power reduction on neuromorphic hardware at 500MHz.

Significance. If the performance-recovery and efficiency claims hold under rigorous validation, the work would be significant for efficient long-context modeling in resource-constrained settings. The combination of hybrid sparse attention, low-overhead T2H conversion, and cross-platform support (GPU FP8 and neuromorphic spiking) addresses key bottlenecks in scaling context length while maintaining multimodal capability, offering a practical route toward lightweight foundation models for edge and specialized hardware.

major comments (2)
  1. [Abstract] Abstract: The abstract reports concrete empirical outcomes (10.13x TTFT speedup at 4M context, capability recovery of Qwen3-4B, 64.31% sparsity) but supplies no evaluation details, full baselines, error bars, data-exclusion rules, or per-task breakdowns; the central performance-recovery and speedup claims therefore cannot be assessed from the provided information.
  2. [Method and Experiments] Method and Experiments (T2H pipeline and DSSA): The claim that the T2H conversion recovers most base-model capability is load-bearing for the headline efficiency results, yet the manuscript provides no ablations on DSSA components (MoBA + SSE) or explicit long-context metrics such as Needle-in-Haystack at 4M; the approximation inherent in the hybrid sparse attention risks degrading long-range dependencies on untested distributions, directly undermining the practical value of the reported speedups and neuromorphic gains.
minor comments (1)
  1. [Abstract] Abstract: The notation 'SpB2.0-5B' and 'SpB2.0-VL-5B' is introduced without an explicit mapping to the base model sizes or architectural differences from Qwen3-4B.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to targeted revisions that strengthen the presentation of our results without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract reports concrete empirical outcomes (10.13x TTFT speedup at 4M context, capability recovery of Qwen3-4B, 64.31% sparsity) but supplies no evaluation details, full baselines, error bars, data-exclusion rules, or per-task breakdowns; the central performance-recovery and speedup claims therefore cannot be assessed from the provided information.

    Authors: We agree that the abstract's brevity limits the inclusion of full evaluation protocols. The main manuscript details the baselines (primarily Qwen3-4B and other long-context efficient models), per-task results across standard benchmarks, and error bars from repeated runs in Section 4 and the associated tables. Data handling follows the original benchmark protocols, with exclusions noted in Appendix B. To address the concern, we will revise the abstract to include a brief reference to the evaluation framework and direct readers to the experimental sections for complete details, baselines, and breakdowns. revision: partial

  2. Referee: [Method and Experiments] Method and Experiments (T2H pipeline and DSSA): The claim that the T2H conversion recovers most base-model capability is load-bearing for the headline efficiency results, yet the manuscript provides no ablations on DSSA components (MoBA + SSE) or explicit long-context metrics such as Needle-in-Haystack at 4M; the approximation inherent in the hybrid sparse attention risks degrading long-range dependencies on untested distributions, directly undermining the practical value of the reported speedups and neuromorphic gains.

    Authors: We acknowledge that the manuscript does not present explicit component ablations for DSSA (MoBA + SSE) in the main text, nor a Needle-in-Haystack evaluation at the full 4M scale. Long-context results are instead reported via LongBench and custom retrieval/perplexity tasks up to 4M tokens. We will add a dedicated ablation subsection analyzing the individual and combined contributions of MoBA and SSE, along with Needle-in-Haystack results at contexts up to 1M (with extrapolation analysis to 4M, noting the prohibitive cost of full-attention baselines at extreme lengths). Regarding risks to long-range dependencies, the inter-layer hybrid design interleaves sparse softmax attention for local precision with linear attention for global efficiency; our empirical capability recovery on long-context benchmarks indicates that this preserves essential dependencies without significant degradation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results with no equations or self-referential reductions

full rationale

The paper's abstract and described contributions contain no mathematical derivations, equations, or first-principles predictions. DSSA and the T2H pipeline are introduced as architectural and training innovations, with all performance numbers (10.13x TTFT, 64.31% sparsity, recovery of Qwen3-4B capability) presented as direct empirical measurements on hardware and benchmarks. No fitted parameters are renamed as predictions, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The chain is self-contained because claims rest on external comparisons and measurements rather than internal definitions or tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claims rest on the empirical success of the proposed DSSA architecture and T2H pipeline; insufficient detail exists to enumerate exact free parameters or additional axioms beyond the domain assumption that curated open data suffices for capability recovery.

axioms (1)
  • domain assumption Curated open-source data and the dual conversion paths in T2H recover most base Transformer capability with under 7k GPU hours
    This assumption underpins the claim of minimal training overhead and performance recovery.
invented entities (1)
  • Dual-Space Sparse Attention (DSSA) no independent evidence
    purpose: Inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE) for improved long-context trade-off
    New architectural component introduced to address computation bottlenecks.

pith-pipeline@v0.9.0 · 5744 in / 1479 out tokens · 73434 ms · 2026-05-08T12:14:40.867626+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 37 canonical work pages · 20 internal anchors

  1. [1]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.https://ai

    AI at Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.https://ai. meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ , September

  2. [2]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Accessed: 2026-03-25. Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661,

  3. [3]

    Qwen3-VL Technical Report

    Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models. InProceedings of 12th International Conference on Learning Representations (ICLR). ICLR, 2024a. Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti...

  4. [4]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Aug 2024).https://doi.org/10.18653/v1/2024.acl-long.172

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.172. URL https: //aclanthology.org/2024.acl-long.172. Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient...

  5. [5]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,

  6. [6]

    Untangling component imbalance in hybrid linear attention conversion methods.arXiv preprint arXiv:2510.05901,

    Martin Benfeghoul, Teresa Delgado, Adnan Oomerjee, Haitham Bou Ammar, Jun Wang, and Zafeirios Fountas. Untangling component imbalance in hybrid linear attention conversion methods.arXiv preprint arXiv:2510.05901,

  7. [7]

    Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848, 2025

    23 Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchin- tala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848,

  8. [8]

    Evaluating Large Language Models Trained on Code

    Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention.Advances in Neural Information Processing Systems, 34:17413–17426, 2021a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for e...

  9. [9]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1,

  10. [10]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  11. [11]

    Griffin: Mixing gated linear recurrences with local attention for efficient language models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,

  12. [12]

    Hymba: A Hybrid-head Architecture for Small Language Models

    Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676,

  13. [13]

    Jet-nemotron: Efficient language model with post neural architecture search, 2025

    Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884,

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  15. [15]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021a

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt....

  16. [16]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

  17. [17]

    Parallel training in spiking neural networks.arXiv preprint arXiv:2602.01133,

    Yanbin Huang, Man Yao, Yuqi Pan, Changze Lv, Siyuan Xu, Xiaoqing Zheng, Bo Xu, and Guoqi Li. Parallel training in spiking neural networks.arXiv preprint arXiv:2602.01133,

  18. [18]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509,

  19. [19]

    Mistral 7B

    Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.ArXiv, abs/2310.06825,

  20. [20]

    Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith

    URLhttps://api.semanticscholar.org/ CorpusID:263830494. Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. In2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pp. 10630–10643. Association for Computational L...

  21. [21]

    Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2024a

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2024a. Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed- bench: Benchmarking multimodal large language m...

  22. [22]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processi...

  23. [23]

    A Comprehensive Sur- vey on Long Context Language Modeling.arXiv preprint arXiv:2503.17407, 2025

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URLhttps://llava-vl.github.io/ blog/2024-01-30-llava-next/. Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A...

  24. [24]

    Thinking Machines Lab: Connectionism , year =

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Xinhao Luo, Man Yao, Yuhong Chou, Bo Xu, and Guoqi Li. Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection. InEuropean Conference on Computer Vision, pp. 253–272. Springer,

  25. [25]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pp. 2263–2279,

  26. [26]

    Linearizing large language models.arXiv preprint arXiv:2405.06640,

    Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models.arXiv preprint arXiv:2405.06640,

  27. [27]

    Yuqi Pan, Yongqi An, Zheng Li, Yuhong Chou, Ruijie Zhu, Xiaohui Wang, Mingxuan Wang, Jinqiao Wang, and Guoqi Li

    Hugging Face Space, accessed 2026-03-26. Yuqi Pan, Yongqi An, Zheng Li, Yuhong Chou, Ruijie Zhu, Xiaohui Wang, Mingxuan Wang, Jinqiao Wang, and Guoqi Li. Scaling linear attention with sparse state expansion.arXiv preprint arXiv:2507.16577, 2025a. Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Han Xu, Zehao Liu, Bohan Sun, Yuhong Chou, Xuerui Qiu, Anlin...

  28. [28]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    27 Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,

  29. [29]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201,

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  31. [31]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

  32. [32]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

  33. [33]

    Infinitevl: Synergizing linear and sparse attention for highly-efficient, unlimited-input vision-language models.arXiv preprint arXiv:2512.08829,

    Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, and Xinggang Wang. Infinitevl: Synergizing linear and sparse attention for highly-efficient, unlimited-input vision-language models.arXiv preprint arXiv:2512.08829,

  34. [34]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

  35. [35]

    Kimi Linear: An Expressive, Efficient Attention Architecture

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025a. Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, et al. Every att...

  36. [36]

    SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

    Han Xu, Zhiyong Qin, Di Shang, Jiahong Zhang, Xuerui Qiu, Bo Lei, Tiejun Huang, Bo Xu, and Guoqi Li. Spikemllm: Spike-based multimodal large language models via modality-specific temporal scales and temporal compression, 2026a. URLhttps://arxiv.org/abs/2604.18610. Han Xu, Xuerui Qiu, Baiyu Chen, Xinhao Luo, Xingrun Xing, Jiahong Zhang, Bo Lei, Tiejun Huan...

  37. [37]

    Spike- driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips

    Man Yao, JiaKui Hu, Tianxiang Hu, Yifan Xu, Zhaokun Zhou, Yonghong Tian, Bo XU, and Guoqi Li. Spike- driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips. InThe Twelfth International Conference on Learning Representations, 2024a. Man Yao, Ole Richter, Guangshe Zhao, Ning Qiao, Yannan Xi...

  38. [38]

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al

    doi: 10.1109/ TPAMI.2025.3530246. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  39. [39]

    Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention.arXiv preprint arXiv:2509.24006, 2025a

    Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization. InInternational Conference on Machine Learning, pp. 75097–75119. PMLR, 2025a. Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Z...

  40. [40]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

  41. [41]

    Falcon-H1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448,

    Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, et al. Falcon-h1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448,