Recognition: 2 theorem links
· Lean TheoremJamba: A Hybrid Transformer-Mamba Language Model
Pith reviewed 2026-05-13 14:05 UTC · model grok-4.3
The pith
Jamba interleaves Transformer and Mamba layers with selective MoE to deliver Transformer-level accuracy at lower memory and higher throughput for contexts up to 256K tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Jamba is constructed by interleaving Transformer and Mamba layers, with MoE modules inserted in some of the layers to expand capacity while keeping the number of active parameters manageable. In the released configuration the hybrid fits in one 80GB GPU, exhibits higher throughput and lower memory use than vanilla Transformers, and achieves state-of-the-art performance on standard language-model benchmarks together with strong results on long-context evaluations up to 256K tokens.
What carries the argument
The interleaving of Transformer and Mamba layers together with selective placement of mixture-of-experts modules in some layers.
If this is right
- The hybrid reaches state-of-the-art benchmark scores while using less memory and running faster than pure Transformer baselines.
- Strong performance holds for context lengths up to 256K tokens.
- Specific interleaving ratios and MoE placement decisions prove critical for stable large-scale training.
- The architecture supports flexible re-configuration for different resource or objective constraints.
- Ablation runs reveal several previously unobserved properties of combined Transformer-Mamba stacks.
Where Pith is reading between the lines
- The same interleaving pattern could be tested on non-language sequence tasks such as time-series or protein modeling where long-range dependencies matter.
- Varying the ratio of Mamba to Transformer layers might produce task-specific efficiency-accuracy trade-offs that pure architectures cannot reach.
- Public release of ablation checkpoints allows systematic study of how Mamba layers affect attention patterns in the hybrid stack.
Load-bearing premise
That the chosen pattern of interleaving Transformer and Mamba layers plus selective MoE placement will continue to deliver the reported throughput, memory, and accuracy benefits when the model is scaled further or applied to new objectives.
What would settle it
Training and evaluating a larger-scale Jamba variant on the same benchmarks and seeing whether accuracy stays within a few percent of a matched pure Transformer while throughput and memory advantages remain intact.
read the original abstract
We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Jamba, a hybrid Transformer-Mamba MoE language model that interleaves blocks of Transformer and Mamba layers, with MoE applied selectively in some layers. The central claim is that this architecture delivers high throughput and low memory footprint relative to vanilla Transformers while achieving state-of-the-art results on standard LM benchmarks and strong performance on long-context tasks up to 256K tokens, all while fitting in a single 80GB GPU. The authors report ablation studies on layer ordering and expert mixing, describe emergent properties observed during training, and commit to releasing model weights and ablation checkpoints.
Significance. If the empirical results hold under reproduction, the work is significant for demonstrating a practical, scalable hybrid architecture that combines the strengths of attention-based and state-space models within an MoE framework. This could influence future LLM designs aimed at efficient long-context modeling and high-throughput inference, particularly by showing that selective interleaving and expert placement can mitigate the quadratic costs of Transformers without sacrificing accuracy. The public release of weights and ablations further strengthens its value to the community.
major comments (2)
- [§4] §4 (Experiments and Results): The SOTA and long-context claims rest on benchmark numbers that are referenced but not reproduced in the visible summary; the manuscript must include explicit tables (e.g., Table X comparing perplexity or accuracy against published baselines such as Llama-2 or Mistral) with error bars or multiple seeds to substantiate the central performance assertion, as single-run large-scale results can be sensitive to training variance.
- [§3.2] §3.2 (Architecture details): The specific interleaving ratio (Transformer vs. Mamba blocks) and selective MoE placement are presented as empirically chosen, yet the paper should quantify the sensitivity of throughput/memory/accuracy trade-offs to small changes in this ratio; without such analysis, it is unclear whether the reported gains are robust or tied to a narrow hyperparameter sweet spot.
minor comments (3)
- [Abstract] The abstract states 'state-of-the-art performance' without citing the exact benchmarks or scores; add a parenthetical reference to the main results table.
- [§3] Notation for layer types (e.g., 'T' for Transformer, 'M' for Mamba) should be defined at first use in §3 and used consistently in figures.
- [§4.3] The long-context section would benefit from a brief description of the evaluation protocol (e.g., needle-in-haystack or perplexity on extended sequences) to allow readers to assess the 256K claim.
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation for minor revision. We appreciate the feedback on strengthening the empirical presentation and address each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Experiments and Results): The SOTA and long-context claims rest on benchmark numbers that are referenced but not reproduced in the visible summary; the manuscript must include explicit tables (e.g., Table X comparing perplexity or accuracy against published baselines such as Llama-2 or Mistral) with error bars or multiple seeds to substantiate the central performance assertion, as single-run large-scale results can be sensitive to training variance.
Authors: We agree that explicit tables are necessary to clearly substantiate the performance claims. In the revised manuscript we will add a dedicated results table (Table 1) that directly reports perplexity and accuracy on standard LM benchmarks against published baselines including Llama-2 and Mistral, as well as a separate table for long-context evaluations up to 256K tokens. Regarding error bars and multiple seeds, our primary training runs are single executions given the prohibitive cost of large-scale LLM training; however, we will report variance from the ablation checkpoints (which include multiple configurations) and explicitly note the single-run nature of the main results. We will also highlight the planned public release of weights and ablation checkpoints to enable community reproduction. revision: partial
-
Referee: [§3.2] §3.2 (Architecture details): The specific interleaving ratio (Transformer vs. Mamba blocks) and selective MoE placement are presented as empirically chosen, yet the paper should quantify the sensitivity of throughput/memory/accuracy trade-offs to small changes in this ratio; without such analysis, it is unclear whether the reported gains are robust or tied to a narrow hyperparameter sweet spot.
Authors: We thank the referee for this suggestion. The current manuscript already presents ablation results on layer ordering and expert mixing. To directly quantify sensitivity, we will add a new paragraph and accompanying figure in the revised Section 3.2 (and/or appendix) that reports throughput, memory footprint, and accuracy for small perturbations around the chosen interleaving ratio (e.g., 1:1, 2:1, and 3:1 Transformer-to-Mamba block ratios) and nearby MoE placement patterns. This analysis will demonstrate that performance remains stable within a reasonable range and that the selected configuration is not an isolated sweet spot. revision: yes
Circularity Check
No significant circularity in empirical architecture study
full rationale
The paper reports an empirical construction and large-scale training of a hybrid Transformer-Mamba-MoE model, with architectural choices (layer interleaving, selective MoE) validated via ablations and benchmark numbers rather than any derivation chain. No equations, predictions, or uniqueness claims reduce by construction to fitted parameters or self-citations; performance results are presented as measured outcomes against external baselines, making the central claims self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclearthe model presents strong results for up to 256K tokens context length
Forward citations
Cited by 25 Pith papers
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences
FRACTAL integrates fractional recurrent architecture into SSMs using a tunable singularity index to capture multi-scale temporal features, reporting 87.11% average on Long Range Arena and outperforming S5.
-
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
-
Component-Aware Self-Speculative Decoding in Hybrid Language Models
Component-aware self-speculative decoding achieves high acceptance rates in parallel hybrid models like Falcon-H1 but fails in sequential ones like Qwen3.5, with the gap tied to how components are integrated.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.
-
Priming: Hybrid State Space Models From Pre-trained Transformers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
-
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
-
The Impossibility Triangle of Long-Context Modeling
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
-
Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI
Rhamba uses region-aware masking strategies and hybrid Attention-Mamba models pretrained on ABIDE fMRI data to achieve top AUROC on schizophrenia and ADHD classification tasks while outperforming prior methods.
-
Lost in State Space: Probing Frozen Mamba Representations
Frozen Mamba patch-boundary readouts do not outperform mean pooling for sentence representations on SST-2, CoLA, MRPC, STS-B, and IMDb due to anisotropy (cosine similarity ~0.9999) and representational collapse (MCC=0...
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
-
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
-
Safety, Security, and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral, Stateful, and Capacity Attacks
State-space models are vulnerable to three new attack types that corrupt state integrity, with experiments showing up to 156x output changes and 6x higher targeted corruption than random inputs.
-
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI
Rhamba is a region-aware hybrid Attention-Mamba framework that uses anatomically guided masking for self-supervised pretraining on ABIDE fMRI data and shows competitive AUROC on downstream schizophrenia and ADHD class...
-
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model
Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.
-
LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures
LayerTracer defines task particles as the first layer where target token probability rises sharply and vulnerable layers via maximum JS divergence after masking, showing task particles in deep layers and greater robus...
-
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
Reference graph
Works this paper leans on
-
[1]
The hidden attention of mamba models
Ameen Ali, Itamar Zimerman, and Lior Wolf. The hidden attention of mamba models. arXiv preprint arXiv:2403.01590, 2024
-
[2]
L-Eval: Instituting standardized evaluation for long context language models
Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-Eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023
-
[3]
PIQA: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020
work page 2020
-
[4]
Efficient intent detection with dual sentence encoders
Iñigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli´c. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45, 2020
work page 2020
-
[5]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 13
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
QuAC: Question answering in context
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, 2018
work page 2018
-
[7]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1– 113, 2023
work page 2023
-
[8]
Unified scaling laws for routed language models
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International conference on machine learning, pages 4057–4086. PMLR, 2022
work page 2022
-
[9]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...
work page 2019
-
[10]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Hugging Face. Open LLM leaderboard. https://huggingface.co/spaces/ HuggingFaceH4/open_llm_leaderboard, 2024
work page 2024
-
[13]
Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, and Mark J. F. Gales. Multi-head state space model for speech recognition. In Proceedings of INTERSPEECH 2023, pages 241–245, 2023
work page 2023
-
[14]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[15]
Hungry hungry hippos: Towards language modeling with state space models
Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[16]
A new algorithm for data compression
Philip Gage. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994
work page 1994
-
[17]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021
work page 2021
-
[19]
Combining recurrent, convolutional, and continuous-time models with linear state space layers
Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021
work page 2021
-
[20]
Transformer language models without positional encodings still learn positional information
Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390, 2022
work page 2022
-
[21]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020. 14
work page 2020
-
[22]
CUAD: An expert-annotated NLP dataset for legal contract review
Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. CUAD: An expert-annotated NLP dataset for legal contract review. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021
work page 2021
-
[23]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Needle in a haystack - pressure testing llms
Greg Kamradt. Needle in a haystack - pressure testing llms. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack/, 2023
work page 2023
-
[26]
The NarrativeQA reading comprehension challenge
Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Ga- bor Melis, and Edward Grefenstette. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018
work page 2018
-
[27]
Natural questions: a benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019
work page 2019
-
[28]
An evaluation dataset for intent classification and out-of-scope prediction
Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Intern...
work page 2019
-
[29]
Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002
work page 2002
-
[30]
TruthfulQA: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics
work page 2022
-
[31]
Benchmarking natural lan- guage understanding services for building conversational agents
Xingkun Liu, Arash Eshghi, Pawel Swietojanski, and Verena Rieser. Benchmarking natural lan- guage understanding services for building conversational agents. In Increasing Naturalness and Flexibility in Spoken Dialogue Interaction: 10th International Workshop on Spoken Dialogue Systems, pages 165–183. Springer, 2021
work page 2021
-
[32]
Learning word vectors for sentiment analysis
Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011
work page 2011
-
[33]
Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,
Sabrina J Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y Lee, Benoît Sagot, et al. Between words and charac- ters: A brief history of open-vocabulary modeling and tokenization in NLP. arXiv preprint arXiv:2112.10508, 2021
-
[34]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, 2022
work page 2022
-
[35]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022. 15
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Can Mamba Learn How to Learn? A Comparative Study on in Context Learning Tasks,
Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024
-
[37]
Jonathan Pilault, Mahan Fathi, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, and Ross Goroshin. Block-state transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[38]
MoE-Mamba: Efficient selective state space models with mixture of experts
Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, and Sebastian Jaszczur. MoE-Mamba: Efficient selective state space models with mixture of experts. arXiv preprint arXiv:2401.04081, 2024
-
[39]
Hyena hierarchy: Towards larger convolutional language models
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning , pages 28043–28078. PMLR, 2023
work page 2023
-
[40]
StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models
Michael Poli, Jue Wang, Stefano Massaroli, Jeffrey Quesnelle, Ryan Carlow, Eric Nguyen, and Armin Thomas. StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models. https://github.com/togethercomputer/stripedhyena, 2023
work page 2023
-
[41]
Parallel context windows for large language models
Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6383–6402, 2023
work page 2023
-
[42]
WinoGrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020
work page 2020
-
[43]
Diagonal state space augmented transformers for speech recognition
George Saon, Ankit Gupta, and Xiaodong Cui. Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023
work page 2023
-
[44]
Neural machine translation of rare words with subword units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016
work page 2016
-
[45]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[46]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017
work page 2017
-
[47]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024
work page 2024
-
[48]
Challenging BIG- Bench tasks and whether chain-of-thought can solve them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging BIG- Bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023
work page 2023
-
[49]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 16
work page 2017
-
[52]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019
work page 2019
-
[53]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[54]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
Efficient long sequence modeling via state space augmented transformer
Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu, Tuo Zhao, and Jianfeng Gao. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022. 17
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.