pith. machine review for the scientific record. sign in

arxiv: 2403.19887 · v2 · submitted 2024-03-28 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Jamba: A Hybrid Transformer-Mamba Language Model

Amir Bergman, Avashalom Manevich, Barak Lenz, Erez Safahi, Erez Shwartz, Gal Cohen, Hofit Bata, Itay Dalmedigos, Jhonathan Osin, Michael Gokhman, Mor Zusman, Nir Ratner, Noam Rozen, Omri Abend, Opher Lieber, Raz Alon, Roman Glozman, Shai Shalev-Shwartz, Shaked Meirom, Tomer Asida, Yoav Shoham, Yonatan Belinkov

Authors on Pith no claims yet

Pith reviewed 2026-05-13 14:05 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords hybrid architectureTransformer-Mambamixture-of-expertslong-context modelingefficient inferencestate-space modelslarge language models
0
0 comments X

The pith

Jamba interleaves Transformer and Mamba layers with selective MoE to deliver Transformer-level accuracy at lower memory and higher throughput for contexts up to 256K tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Jamba as a hybrid large language model that combines blocks of Transformer layers with Mamba layers and adds mixture-of-experts in selected layers. This design seeks to capture the modeling strengths of attention-based Transformers while gaining the linear-time, long-sequence efficiency of state-space Mamba models. The resulting model fits inside a single 80GB GPU, runs with higher throughput and smaller memory footprint than comparable pure Transformers, and reaches state-of-the-art scores on standard benchmarks plus strong results at 256K-token context lengths. A sympathetic reader cares because the work shows a concrete route to scaling language models without paying the full quadratic cost of attention for every token. The authors back the claim with large-scale training runs and systematic ablations that highlight which interleaving and expert-mixing choices matter at this scale.

Core claim

Jamba is constructed by interleaving Transformer and Mamba layers, with MoE modules inserted in some of the layers to expand capacity while keeping the number of active parameters manageable. In the released configuration the hybrid fits in one 80GB GPU, exhibits higher throughput and lower memory use than vanilla Transformers, and achieves state-of-the-art performance on standard language-model benchmarks together with strong results on long-context evaluations up to 256K tokens.

What carries the argument

The interleaving of Transformer and Mamba layers together with selective placement of mixture-of-experts modules in some layers.

If this is right

  • The hybrid reaches state-of-the-art benchmark scores while using less memory and running faster than pure Transformer baselines.
  • Strong performance holds for context lengths up to 256K tokens.
  • Specific interleaving ratios and MoE placement decisions prove critical for stable large-scale training.
  • The architecture supports flexible re-configuration for different resource or objective constraints.
  • Ablation runs reveal several previously unobserved properties of combined Transformer-Mamba stacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interleaving pattern could be tested on non-language sequence tasks such as time-series or protein modeling where long-range dependencies matter.
  • Varying the ratio of Mamba to Transformer layers might produce task-specific efficiency-accuracy trade-offs that pure architectures cannot reach.
  • Public release of ablation checkpoints allows systematic study of how Mamba layers affect attention patterns in the hybrid stack.

Load-bearing premise

That the chosen pattern of interleaving Transformer and Mamba layers plus selective MoE placement will continue to deliver the reported throughput, memory, and accuracy benefits when the model is scaled further or applied to new objectives.

What would settle it

Training and evaluating a larger-scale Jamba variant on the same benchmarks and seeing whether accuracy stays within a few percent of a matched pure Transformer while throughput and memory advantages remain intact.

read the original abstract

We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Jamba, a hybrid Transformer-Mamba MoE language model that interleaves blocks of Transformer and Mamba layers, with MoE applied selectively in some layers. The central claim is that this architecture delivers high throughput and low memory footprint relative to vanilla Transformers while achieving state-of-the-art results on standard LM benchmarks and strong performance on long-context tasks up to 256K tokens, all while fitting in a single 80GB GPU. The authors report ablation studies on layer ordering and expert mixing, describe emergent properties observed during training, and commit to releasing model weights and ablation checkpoints.

Significance. If the empirical results hold under reproduction, the work is significant for demonstrating a practical, scalable hybrid architecture that combines the strengths of attention-based and state-space models within an MoE framework. This could influence future LLM designs aimed at efficient long-context modeling and high-throughput inference, particularly by showing that selective interleaving and expert placement can mitigate the quadratic costs of Transformers without sacrificing accuracy. The public release of weights and ablations further strengthens its value to the community.

major comments (2)
  1. [§4] §4 (Experiments and Results): The SOTA and long-context claims rest on benchmark numbers that are referenced but not reproduced in the visible summary; the manuscript must include explicit tables (e.g., Table X comparing perplexity or accuracy against published baselines such as Llama-2 or Mistral) with error bars or multiple seeds to substantiate the central performance assertion, as single-run large-scale results can be sensitive to training variance.
  2. [§3.2] §3.2 (Architecture details): The specific interleaving ratio (Transformer vs. Mamba blocks) and selective MoE placement are presented as empirically chosen, yet the paper should quantify the sensitivity of throughput/memory/accuracy trade-offs to small changes in this ratio; without such analysis, it is unclear whether the reported gains are robust or tied to a narrow hyperparameter sweet spot.
minor comments (3)
  1. [Abstract] The abstract states 'state-of-the-art performance' without citing the exact benchmarks or scores; add a parenthetical reference to the main results table.
  2. [§3] Notation for layer types (e.g., 'T' for Transformer, 'M' for Mamba) should be defined at first use in §3 and used consistently in figures.
  3. [§4.3] The long-context section would benefit from a brief description of the evaluation protocol (e.g., needle-in-haystack or perplexity on extended sequences) to allow readers to assess the 256K claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation for minor revision. We appreciate the feedback on strengthening the empirical presentation and address each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments and Results): The SOTA and long-context claims rest on benchmark numbers that are referenced but not reproduced in the visible summary; the manuscript must include explicit tables (e.g., Table X comparing perplexity or accuracy against published baselines such as Llama-2 or Mistral) with error bars or multiple seeds to substantiate the central performance assertion, as single-run large-scale results can be sensitive to training variance.

    Authors: We agree that explicit tables are necessary to clearly substantiate the performance claims. In the revised manuscript we will add a dedicated results table (Table 1) that directly reports perplexity and accuracy on standard LM benchmarks against published baselines including Llama-2 and Mistral, as well as a separate table for long-context evaluations up to 256K tokens. Regarding error bars and multiple seeds, our primary training runs are single executions given the prohibitive cost of large-scale LLM training; however, we will report variance from the ablation checkpoints (which include multiple configurations) and explicitly note the single-run nature of the main results. We will also highlight the planned public release of weights and ablation checkpoints to enable community reproduction. revision: partial

  2. Referee: [§3.2] §3.2 (Architecture details): The specific interleaving ratio (Transformer vs. Mamba blocks) and selective MoE placement are presented as empirically chosen, yet the paper should quantify the sensitivity of throughput/memory/accuracy trade-offs to small changes in this ratio; without such analysis, it is unclear whether the reported gains are robust or tied to a narrow hyperparameter sweet spot.

    Authors: We thank the referee for this suggestion. The current manuscript already presents ablation results on layer ordering and expert mixing. To directly quantify sensitivity, we will add a new paragraph and accompanying figure in the revised Section 3.2 (and/or appendix) that reports throughput, memory footprint, and accuracy for small perturbations around the chosen interleaving ratio (e.g., 1:1, 2:1, and 3:1 Transformer-to-Mamba block ratios) and nearby MoE placement patterns. This analysis will demonstrate that performance remains stable within a reasonable range and that the selected configuration is not an isolated sweet spot. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical architecture study

full rationale

The paper reports an empirical construction and large-scale training of a hybrid Transformer-Mamba-MoE model, with architectural choices (layer interleaving, selective MoE) validated via ablations and benchmark numbers rather than any derivation chain. No equations, predictions, or uniqueness claims reduce by construction to fitted parameters or self-citations; performance results are presented as measured outcomes against external baselines, making the central claims self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard transformer and Mamba layer definitions plus conventional MoE routing; no new free parameters, axioms, or invented entities are introduced beyond the choice of interleaving pattern.

pith-pipeline@v0.9.0 · 5617 in / 1091 out tokens · 50850 ms · 2026-05-13T14:05:48.374116+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  2. FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences

    cs.AI 2026-05 unverdicted novelty 7.0

    FRACTAL integrates fractional recurrent architecture into SSMs using a tunable singularity index to capture multi-scale temporal features, reporting 87.11% average on Long Range Arena and outperforming S5.

  3. Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

  4. Retrieval from Within: An Intrinsic Capability of Attention-Based Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.

  5. Component-Aware Self-Speculative Decoding in Hybrid Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Component-aware self-speculative decoding achieves high acceptance rates in parallel hybrid models like Falcon-H1 but fails in sequential ones like Qwen3.5, with the gap tied to how components are integrated.

  6. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  7. Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

    cs.CL 2026-05 unverdicted novelty 6.0

    Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.

  8. Priming: Hybrid State Space Models From Pre-trained Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...

  9. Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

    cs.LG 2026-05 unverdicted novelty 6.0

    Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.

  10. Retrieval from Within: An Intrinsic Capability of Attention-Based Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.

  11. The Impossibility Triangle of Long-Context Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

  12. Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI

    cs.LG 2026-05 unverdicted novelty 6.0

    Rhamba uses region-aware masking strategies and hybrid Attention-Mamba models pretrained on ABIDE fMRI data to achieve top AUROC on schizophrenia and ADHD classification tasks while outperforming prior methods.

  13. Lost in State Space: Probing Frozen Mamba Representations

    cs.CL 2026-04 unverdicted novelty 6.0

    Frozen Mamba patch-boundary readouts do not outperform mean pooling for sentence representations on SST-2, CoLA, MRPC, STS-B, and IMDb due to anisotropy (cosine similarity ~0.9999) and representational collapse (MCC=0...

  14. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

  15. HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

    cs.LG 2026-04 unverdicted novelty 6.0

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  16. MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...

  17. Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models

    cs.AR 2026-04 unverdicted novelty 6.0

    Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.

  18. Safety, Security, and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral, Stateful, and Capacity Attacks

    cs.CR 2026-04 unverdicted novelty 6.0

    State-space models are vulnerable to three new attack types that corrupt state integrity, with experiments showing up to 156x output changes and 6x higher targeted corruption than random inputs.

  19. MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

    cs.CL 2026-05 unverdicted novelty 5.0

    MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.

  20. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  21. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  22. Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI

    cs.LG 2026-05 unverdicted novelty 5.0

    Rhamba is a region-aware hybrid Attention-Mamba framework that uses anatomically guided masking for self-supervised pretraining on ABIDE fMRI data and shows competitive AUROC on downstream schizophrenia and ADHD class...

  23. Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

    cs.LG 2026-04 unverdicted novelty 5.0

    Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.

  24. LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures

    cs.CL 2026-04 unverdicted novelty 5.0

    LayerTracer defines task particles as the first layer where target token probability rises sharply and vulnerable layers via maximum JS divergence after masking, showing task particles in deep layers and greater robus...

  25. FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

    cs.LG 2026-04 unverdicted novelty 5.0

    FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 23 Pith papers · 11 internal anchors

  1. [1]

    The hidden attention of mamba models

    Ameen Ali, Itamar Zimerman, and Lior Wolf. The hidden attention of mamba models. arXiv preprint arXiv:2403.01590, 2024

  2. [2]

    L-Eval: Instituting standardized evaluation for long context language models

    Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-Eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023

  3. [3]

    PIQA: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020

  4. [4]

    Efficient intent detection with dual sentence encoders

    Iñigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli´c. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45, 2020

  5. [5]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 13

  6. [6]

    QuAC: Question answering in context

    Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, 2018

  7. [7]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1– 113, 2023

  8. [8]

    Unified scaling laws for routed language models

    Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International conference on machine learning, pages 4057–4086. PMLR, 2022

  9. [9]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

  10. [10]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  12. [12]

    Open LLM leaderboard

    Hugging Face. Open LLM leaderboard. https://huggingface.co/spaces/ HuggingFaceH4/open_llm_leaderboard, 2024

  13. [13]

    Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, and Mark J. F. Gales. Multi-head state space model for speech recognition. In Proceedings of INTERSPEECH 2023, pages 241–245, 2023

  14. [14]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022

  15. [15]

    Hungry hungry hippos: Towards language modeling with state space models

    Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2022

  16. [16]

    A new algorithm for data compression

    Philip Gage. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994

  17. [17]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  18. [18]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021

  19. [19]

    Combining recurrent, convolutional, and continuous-time models with linear state space layers

    Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021

  20. [20]

    Transformer language models without positional encodings still learn positional information

    Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390, 2022

  21. [21]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020. 14

  22. [22]

    CUAD: An expert-annotated NLP dataset for legal contract review

    Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. CUAD: An expert-annotated NLP dataset for legal contract review. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

  23. [23]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  24. [24]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

  25. [25]

    Needle in a haystack - pressure testing llms

    Greg Kamradt. Needle in a haystack - pressure testing llms. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack/, 2023

  26. [26]

    The NarrativeQA reading comprehension challenge

    Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Ga- bor Melis, and Edward Grefenstette. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018

  27. [27]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019

  28. [28]

    An evaluation dataset for intent classification and out-of-scope prediction

    Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Intern...

  29. [29]

    Learning question classifiers

    Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002

  30. [30]

    TruthfulQA: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics

  31. [31]

    Benchmarking natural lan- guage understanding services for building conversational agents

    Xingkun Liu, Arash Eshghi, Pawel Swietojanski, and Verena Rieser. Benchmarking natural lan- guage understanding services for building conversational agents. In Increasing Naturalness and Flexibility in Spoken Dialogue Interaction: 10th International Workshop on Spoken Dialogue Systems, pages 165–183. Springer, 2021

  32. [32]

    Learning word vectors for sentiment analysis

    Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011

  33. [33]

    Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,

    Sabrina J Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y Lee, Benoît Sagot, et al. Between words and charac- ters: A brief history of open-vocabulary modeling and tokenization in NLP. arXiv preprint arXiv:2112.10508, 2021

  34. [34]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, 2022

  35. [35]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022. 15

  36. [36]

    Can Mamba Learn How to Learn? A Comparative Study on in ­Context Learning Tasks,

    Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024

  37. [37]

    Block-state transformers

    Jonathan Pilault, Mahan Fathi, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, and Ross Goroshin. Block-state transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  38. [38]

    MoE-Mamba: Efficient selective state space models with mixture of experts

    Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, and Sebastian Jaszczur. MoE-Mamba: Efficient selective state space models with mixture of experts. arXiv preprint arXiv:2401.04081, 2024

  39. [39]

    Hyena hierarchy: Towards larger convolutional language models

    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning , pages 28043–28078. PMLR, 2023

  40. [40]

    StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models

    Michael Poli, Jue Wang, Stefano Massaroli, Jeffrey Quesnelle, Ryan Carlow, Eric Nguyen, and Armin Thomas. StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models. https://github.com/togethercomputer/stripedhyena, 2023

  41. [41]

    Parallel context windows for large language models

    Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6383–6402, 2023

  42. [42]

    WinoGrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020

  43. [43]

    Diagonal state space augmented transformers for speech recognition

    George Saon, Ankit Gupta, and Xiaodong Cui. Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  44. [44]

    Neural machine translation of rare words with subword units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016

  45. [45]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

  46. [46]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017

  47. [47]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

  48. [48]

    Challenging BIG- Bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging BIG- Bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023

  49. [49]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  50. [50]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  51. [51]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 16

  52. [52]

    HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

  53. [53]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

  54. [54]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022

  55. [55]

    Efficient long sequence modeling via state space augmented transformer

    Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu, Tuo Zhao, and Jianfeng Gao. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022. 17