An Empirical Study of Mamba-based Language Models
Pith reviewed 2026-05-18 10:25 UTC · model grok-4.3
The pith
The 8B Mamba-2-Hybrid outperforms a standard 8B Transformer on all twelve evaluated tasks while enabling much faster inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a controlled comparison, the 8B Mamba-2-Hybrid architecture consisting of 43 percent Mamba-2, 7 percent attention, and 50 percent MLP layers exceeds the 8B Transformer by 2.65 points on average across twelve standard tasks and is projected to generate tokens up to eight times faster at inference time; the same hybrid remains competitive with the Transformer on twenty-three additional long-context tasks when both are extended to 16K, 32K, and 128K sequence lengths.
What carries the argument
The Mamba-2-Hybrid architecture, which interleaves selective state-space layers with a small number of attention layers to improve copying and in-context learning while retaining linear-time inference.
If this is right
- Hybrid designs that combine a majority of Mamba-2 layers with a minority of attention layers can exceed pure Transformers on both short and long-context benchmarks.
- Inference throughput gains of up to 8x become available without sacrificing task accuracy when the hybrid proportion is used.
- Pure Mamba models remain limited on tasks that demand explicit copying or few-shot in-context learning even at 8B scale.
- The released checkpoints allow direct reproduction and extension of the scaling behavior observed up to 3.5T tokens.
- Long-context extensions of the hybrid maintain parity with Transformers when both architectures receive the same context-length adaptations.
Where Pith is reading between the lines
- At still larger scales the inference-speed advantage of the hybrid could become decisive for production deployment where latency and memory costs dominate.
- The optimal fraction of attention layers may vary by domain and could be tuned automatically rather than fixed at 7 percent.
- The results suggest that selective state-space models benefit from targeted attention injection specifically for in-context reasoning rather than uniform replacement of all layers.
Load-bearing premise
The 8B models were trained under sufficiently identical data, optimizer, learning-rate schedule, and regularization conditions so that performance gaps can be attributed primarily to architecture.
What would settle it
Re-train an 8B Transformer using the exact same data mixture, optimizer, and schedule as the hybrid and measure whether the 2.65-point average gap disappears.
read the original abstract
Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA's Megatron-LM project.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a direct empirical comparison of 8B-parameter Mamba, Mamba-2, Transformer, and Mamba-2-Hybrid models trained on the same datasets of up to 3.5T tokens. Pure SSMs match or exceed Transformers on many tasks but lag on copying/in-context learning and long-context reasoning; the Mamba-2-Hybrid (43% Mamba-2, 7% attention, 50% MLP) exceeds the Transformer on all 12 standard tasks by +2.65 points on average, maintains parity on 23 additional long-context tasks up to 128K, and is predicted to offer up to 8x faster inference. Checkpoints and training code are released.
Significance. If training conditions are equivalent, the work supplies concrete evidence that hybrid SSM-attention models can outperform pure Transformers at the 8B scale while delivering inference efficiency gains. The public release of checkpoints and Megatron-LM code is a clear strength for reproducibility and follow-on research.
major comments (1)
- The central claim that the +2.65 average gain is attributable to the hybrid layer mix requires that data, optimizer, learning-rate schedule, and regularization were identical across the 8B Transformer and Mamba-2-Hybrid. The manuscript states only that models were 'trained on the same datasets of up to 3.5T tokens' and supplies no table or section listing per-model values for peak LR, decay, Adam betas, weight decay, or clipping. Even modest differences in these settings could produce score shifts comparable to the reported margin.
minor comments (2)
- Reporting standard deviation across random seeds for the 8B models would strengthen defensibility of the architectural conclusion, especially for the headline +2.65 average.
- The long-context section should explicitly state whether the 16K/32K/128K variants were trained from scratch or obtained via continued pre-training / fine-tuning of the base models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. We address the concern regarding training configuration details below.
read point-by-point responses
-
Referee: The central claim that the +2.65 average gain is attributable to the hybrid layer mix requires that data, optimizer, learning-rate schedule, and regularization were identical across the 8B Transformer and Mamba-2-Hybrid. The manuscript states only that models were 'trained on the same datasets of up to 3.5T tokens' and supplies no table or section listing per-model values for peak LR, decay, Adam betas, weight decay, or clipping. Even modest differences in these settings could produce score shifts comparable to the reported margin.
Authors: We agree that explicit documentation of the full training configuration is necessary to substantiate that performance differences arise from the layer mix rather than hyperparameter variations. All models were trained under identical conditions in the Megatron-LM framework, using the same datasets, optimizer settings, peak learning rate, decay schedule, Adam betas, weight decay, and gradient clipping. To improve transparency, we will add a new table in the revised manuscript that lists these per-model hyperparameter values. revision: yes
Circularity Check
No circularity: purely empirical comparisons with no derivations or fitted predictions
full rationale
The manuscript is an empirical study that trains 8B-parameter Mamba, Mamba-2, Transformer, and Mamba-2-Hybrid models on identical datasets of up to 3.5T tokens and reports direct performance measurements on 12 standard tasks plus 23 long-context tasks. No equations, fitted parameters, or mathematical derivations appear; the central claim that the hybrid exceeds the Transformer by +2.65 points on average is presented as an observed outcome of independently trained models rather than a prediction derived from any model equation or self-citation. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises. The analysis is therefore self-contained against external benchmarks of model performance.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Models trained on identical data and comparable optimization settings allow direct attribution of performance differences to architecture
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LedgerCanonicalityZeroParameterComparisonLedger echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
models were 'trained on the same datasets of up to 3.5T tokens' but supplies no table or section listing per-model values for peak LR, decay, Adam betas, weight decay, or clipping
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Hidden State Poisoning Attacks against Mamba-based Language Models
Short input phrases can irreversibly overwrite hidden states in Mamba models, impairing information retrieval on a new benchmark while leaving pure Transformer models unaffected.
-
Selective Rotary Position Embedding
Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when ...
-
Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training
Q-RAG trains embedders via RL for multi-step retrieval and reports state-of-the-art results on BabiLong and RULER benchmarks for contexts up to 10M tokens.
-
Priming: Hybrid State Space Models From Pre-trained Transformers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
-
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI
Rhamba uses region-aware masking strategies and hybrid Attention-Mamba models pretrained on ABIDE fMRI data to achieve top AUROC on schizophrenia and ADHD classification tasks while outperforming prior methods.
-
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
-
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
-
Safety, Security, and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral, Stateful, and Capacity Attacks
State-space models are vulnerable to three new attack types that corrupt state integrity, with experiments showing up to 156x output changes and 6x higher targeted corruption than random inputs.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI
Rhamba is a region-aware hybrid Attention-Mamba framework that uses anatomically guided masking for self-supervised pretraining on ABIDE fMRI data and shows competitive AUROC on downstream schizophrenia and ADHD class...
-
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
TTT3R: 3D Reconstruction as Test-Time Training
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
-
StateX: Enhancing RNN Recall via Post-training State Expansion
StateX post-trains RNNs to expand recurrent state size, improving recall and in-context learning with negligible parameter growth.
-
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-cont...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. “GPT-4 Technical Report”. In:arXiv preprint arXiv:2303.08774(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-head Checkpoints”. In:arXiv preprint arXiv:2305.13245(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Zoology: Measuring and Improving Recall in Efficient Language Models
Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. “Zoology: Measuring and Improving Recall in Efficient Language Models”. In:arXiv preprint arXiv:2312.04927(2023)
-
[4]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer Normalization”. In:arXiv preprint arXiv:1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly Learning to Align and Translate”. In:arXiv preprint arXiv:1409.0473(2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[6]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. “LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding”. In:arXiv preprint arXiv:2308.14508 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In:Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 05. 2020, pp. 7432–7439
work page 2020
-
[8]
bloc97. “NTK-aware Scaled RoPE allows LLaMA models to have Extended (8k+) Context Size Without any Fine-tuning and Minimal Perplexity Degradation”. In: (2023).url: https: //www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_%20scaled_rope_allows_ llama_models_to_have
work page 2023
-
[9]
Language Models are Few-shot Learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language Models are Few-shot Learners”. In:Advances in Neural Information Processing Systems33 (2020), pp. 1877– 1901
work page 2020
-
[10]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”. In:International Conference on Machine Learning (ICML). 2024
work page 2024
-
[12]
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. “A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers”. In:Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, pp. 4599–4610. 16
work page 2021
-
[13]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. “Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models”. In:arXiv preprint arXiv:2402.19427 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou.A Framework ...
-
[15]
Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712,
Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. “Zamba: A Compact 7B SSM Hybrid Model”. In:arXiv preprint arXiv:2405.16712 (2024)
-
[16]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. “Mamba: Linear-time Sequence Modeling with Selective State Spaces”. In: arXiv preprint arXiv:2312.00752(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Re. “Efficiently Modeling Long Sequences with Structured State Spaces”. In:International Conference on Learning Representations. 2021
work page 2021
-
[18]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring Massive Multitask Language Understanding”. In:International Conference on Learning Representations. 2020
work page 2020
-
[19]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. “Gaussian Error Linear Units (GELUs)”. In:arXiv preprint arXiv:1606.08415 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
Constructing A Multi- hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. “Constructing A Multi- hop QA Dataset for Comprehensive Evaluation of Reasoning Steps”. In:Proceedings of the 28th International Conference on Computational Linguistics. 2020, pp. 6609–6625
work page 2020
-
[21]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In: arXiv preprint arXiv:2404.06654(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Repeat After Me: Transformers are Better than State Space Models at Copying
Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. “Repeat After Me: Transformers are Better than State Space Models at Copying”. In:arXiv preprint arXiv:2402.01032 (2024)
-
[23]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. “Mistral 7B”. In:arXiv preprint arXiv:2310.06825(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
PubMedQA: A Dataset for Biomedical Research Question Answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. “PubMedQA: A Dataset for Biomedical Research Question Answering”. In:arXiv preprint arXiv:1909.06146 (2019)
-
[25]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”. In:Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017, pp. 1601–1611
work page 2017
-
[26]
The NarrativeQA Reading Comprehension Challenge
Tomáš Kočisk` y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. “The NarrativeQA Reading Comprehension Challenge”. In: Transactions of the Association for Computational Linguistics6 (2018), pp. 317–328
work page 2018
-
[27]
Reducing activation recomputation in large transformer models.arXiv preprint arXiv:2205.05198,
Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. “Reducing Activation Recomputation in Large Transformer Models”. In:arXiv preprint arXiv:2205.05198(2022)
-
[28]
Taku Kudo and John Richardson. “Sentencepiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing”. In:arXiv preprint arXiv:1808.06226 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. “RACE: Large-scale ReAding Comprehension Dataset From Examinations”. In:arXiv preprint arXiv:1704.04683 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Latent Retrieval for Weakly Supervised Open Domain Question Answering
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. “Latent Retrieval for Weakly Supervised Open Domain Question Answering”. In:arXiv preprint arXiv:1906.00300(2019). 17
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[31]
Xin Li and Dan Roth. “Learning Question Classifiers”. In:COLING 2002: The 19th International Conference on Computational Linguistics. 2002
work page 2002
-
[32]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. “Jamba: A Hybrid Transformer- mamba Language Model”. In:arXiv preprint arXiv:2403.19887(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In:arXiv preprint arXiv:2109.07958(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In:arXiv preprint arXiv:1809.02789 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Efficient Large-scale Language Model Training on GPU Clusters using Megatron-LM
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. “Efficient Large-scale Language Model Training on GPU Clusters using Megatron-LM”. In: Proceedings of the International Conference for High Performance Computing, Networking...
work page 2021
-
[36]
NVIDIA. NVIDIA H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/ h100/. 2023
work page 2023
-
[37]
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks
Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. “Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks”. In:arXiv preprint arXiv:2402.04248(2024)
-
[38]
Nemotron-4 15B Technical Report
Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subrama- nian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, et al. “Nemotron-4 15B Technical Report”. In:arXiv preprint arXiv:2402.16819(2024)
-
[39]
Jonathan Pilault, Mahan Fathi, Orhan Firat, Chris Pal, Pierre-Luc Bacon, and Ross Goroshin. “Block-state Transformers”. In:Advances in Neural Information Processing Systems36 (2024)
work page 2024
-
[40]
Know What You Don't Know: Unanswerable Questions for SQuAD
Pranav Rajpurkar, Robin Jia, and Percy Liang. “Know what you don’t Know: Unanswerable Questions for SQuAD”. In:arXiv preprint arXiv:1806.03822(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”. In:Communications of the ACM64.9 (2021), pp. 99–106
work page 2021
-
[42]
Diagonal State Space Augmented Transformers for Speech Recognition
George Saon, Ankit Gupta, and Xiaodong Cui. “Diagonal State Space Augmented Transformers for Speech Recognition”. In:ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, pp. 1–5
work page 2023
-
[43]
Scrolls: Standardized Comparison over Long Language Sequences
Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, et al. “Scrolls: Standardized Comparison over Long Language Sequences”. In:arXiv preprint arXiv:2201.03533(2022)
-
[44]
GLU Variants Improve Transformer
Noam Shazeer. “GLU Variants Improve Transformer”. In:arXiv preprint arXiv:2002.05202(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[45]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. “Megatron-LM: Training Multi-billion Parameter Language Models using Model Parallelism”. In:arXiv preprint arXiv:1909.08053(2019)
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[46]
Roformer: En- hanced Transformer with Rotary Position Embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. “Roformer: En- hanced Transformer with Rotary Position Embedding”. In:Neurocomputing568 (2024), p. 127063
work page 2024
-
[47]
Efficient Transformers: A Survey
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. “Efficient Transformers: A Survey”. In: ACM Computing Surveys55.6 (2022), pp. 1–28
work page 2022
-
[48]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. “Llama 2: Open Foundation and Fine-tuned Chat Models”. In:arXiv preprint arXiv:2307.09288(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
MuSiQue: Multihop Questions via Single-hop Question Composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. “MuSiQue: Multihop Questions via Single-hop Question Composition”. In:Transactions of the Association for Computational Linguistics10 (2022), pp. 539–554
work page 2022
-
[50]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All You Need”. In:Advances in Neural Infor- mation Processing Systems30 (2017)
work page 2017
-
[51]
Thomas Wolff.Lectures on Harmonic Analysis, volume 29 ofUniversity Lecture Series
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. “Effective Long-context Scaling of Foundation Models”. In:arXiv preprint arXiv:2309.16039(2023). 18
-
[52]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering”. In:Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, pp. 2369–2380
work page 2018
-
[53]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. “HellaSwag: Can a Machine Really Finish your Sentence?” In:arXiv preprint arXiv:1905.07830(2019)
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[54]
Root Mean Square Layer Normalization
Biao Zhang and Rico Sennrich. “Root Mean Square Layer Normalization”. In:Advances in Neural Information Processing Systems32 (2019). A Hybrid Layer Allocation Algorithm Although we are able to specify, and experiment with, an arbitrary sequence of Mamba, self-attention, and MLP layers in our hybrid models, by default we use the allocation algorithm descri...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.