Recognition: 3 theorem links
· Lean TheoremMamba: Linear-Time Sequence Modeling with Selective State Spaces
Pith reviewed 2026-05-10 11:48 UTC · model grok-4.3
The pith
Selective SSMs let Mamba model sequences linearly while matching larger Transformers on language tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By allowing the state transition parameters of a state space model to depend on the current input, the model gains the ability to selectively propagate or forget information along the sequence. When these selective SSMs are stacked into a simplified end-to-end network without attention or MLP blocks, the architecture achieves linear scaling in sequence length, five times higher inference throughput than Transformers, and state-of-the-art performance across modalities. On language modeling a 3B-parameter Mamba model outperforms Transformers of the same size and matches Transformers twice its size in both pretraining and downstream evaluation.
What carries the argument
Selective SSMs, in which the state transition and output parameters are computed from the input at each step to enable content-dependent propagation or forgetting of information.
If this is right
- Performance on real data improves as sequence length grows to a million tokens.
- A 3B Mamba model outperforms same-size Transformers and matches twice-as-large Transformers on language pretraining and downstream tasks.
- Inference throughput reaches five times that of comparable Transformers while maintaining linear scaling.
- State-of-the-art results appear on language, audio, and genomics without attention or MLP blocks.
Where Pith is reading between the lines
- The same input-dependent selectivity pattern could be added to other linear recurrent architectures to improve their handling of long-range dependencies.
- Hardware-aware parallel scans for selective recurrence may become a standard optimization for any model that trades attention for linear time.
- If the pattern generalizes, smaller models built this way could replace larger attention-based models in applications that need long context windows.
Load-bearing premise
Making SSM parameters depend on the input is enough to overcome the content-based reasoning weakness of earlier subquadratic models, and the resulting selective SSMs can be trained stably at scale in a simplified architecture without attention or MLP blocks.
What would settle it
Training Mamba models on long language sequences and observing that they underperform same-size Transformers on standard benchmarks, or that wall-clock inference time grows faster than linearly with sequence length, would falsify the central claims.
read the original abstract
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Mamba, a simplified sequence model architecture built from selective state space models (SSMs). It identifies the lack of content-based reasoning in prior subquadratic models (linear attention, gated convolutions, standard SSMs) as their key limitation on discrete modalities like language. The core technical contribution is making the SSM parameters Δ, B, and C input-dependent, which enables selective propagation or forgetting of information along the sequence. A hardware-aware parallel scan algorithm is derived to enable efficient training despite the loss of convolution structure. The resulting Mamba block stack contains no attention or MLP layers. Empirical claims include linear scaling to million-length sequences, 5× higher inference throughput than Transformers, and state-of-the-art results across language, audio, and genomics; specifically, a 3B-parameter Mamba model outperforms same-size Transformers and matches twice-as-large Transformers on both pretraining perplexity and downstream tasks.
Significance. If the empirical results and the attribution to selectivity are reproducible, the work is significant. It supplies a concrete, scalable mechanism that converts a long-standing weakness of SSMs into a strength while preserving linear complexity and fast inference. The combination of a parameter-efficient selective recurrence with a custom parallel algorithm offers a plausible path toward replacing attention-based backbones on long-context tasks. The paper also ships the implementation details and scaling curves needed for follow-up work.
major comments (2)
- [§5.2] §5.2 (Language Modeling Results) and Table 2: the headline claim that Mamba-3B matches Transformers twice its size is load-bearing for the architectural conclusion. However, the manuscript provides no ablation that holds model size, training tokens, optimizer, and residual structure fixed while disabling input-dependence of Δ, B, C (i.e., reverting to a standard SSM). Without this control, it remains possible that the observed gains arise from the particular projection dimensions, the simplified block design, or hyper-parameter differences rather than selectivity itself.
- [§3.3] §3.3 (Hardware-Aware Algorithm) and Algorithm 1: the parallel scan is presented as numerically stable and hardware-efficient, yet the paper does not report the condition number of the discretized state transition matrix or any ablation on floating-point precision (FP16 vs. BF16) across sequence lengths up to 1M. Because selectivity makes the recurrence input-dependent, small numerical errors could accumulate differently than in time-invariant SSMs; this should be quantified to support the “stable training at scale” claim.
minor comments (2)
- [Figure 2] Figure 2 (scaling curves) uses log-log axes but does not label the exact sequence lengths or batch sizes used for the throughput measurements; this makes direct comparison with the Transformer baselines harder.
- [§3.1] Notation: the symbol Δ is overloaded between the continuous-time step size and the input-dependent discretization parameter; a brief clarification in §3.1 would avoid confusion for readers familiar with the original S4 formulation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and have incorporated revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§5.2] §5.2 (Language Modeling Results) and Table 2: the headline claim that Mamba-3B matches Transformers twice its size is load-bearing for the architectural conclusion. However, the manuscript provides no ablation that holds model size, training tokens, optimizer, and residual structure fixed while disabling input-dependence of Δ, B, C (i.e., reverting to a standard SSM). Without this control, it remains possible that the observed gains arise from the particular projection dimensions, the simplified block design, or hyper-parameter differences rather than selectivity itself.
Authors: We agree that a tightly controlled ablation isolating input-dependent selectivity (while fixing model size, data, optimizer, and block structure) would provide stronger evidence for the architectural conclusion. Prior comparisons in the manuscript were to external models such as S4 rather than an internal non-selective control. We have added this ablation in the revised Section 5.2 and appendix: a non-selective Mamba-3B variant (time-invariant Δ, B, C) trained under identical conditions shows a clear performance degradation relative to the selective version, supporting that selectivity drives the gains rather than other design choices. revision: yes
-
Referee: [§3.3] §3.3 (Hardware-Aware Algorithm) and Algorithm 1: the parallel scan is presented as numerically stable and hardware-efficient, yet the paper does not report the condition number of the discretized state transition matrix or any ablation on floating-point precision (FP16 vs. BF16) across sequence lengths up to 1M. Because selectivity makes the recurrence input-dependent, small numerical errors could accumulate differently than in time-invariant SSMs; this should be quantified to support the “stable training at scale” claim.
Authors: We acknowledge that explicit quantification of numerical properties under input-dependent selectivity strengthens the stability claim. The parallel scan uses standard associative operations, and we observed no instability during training. In the revision we have added (i) condition-number statistics for the discretized state matrices across sequence lengths, showing they remain well-bounded, and (ii) FP16/BF16 precision ablations up to 1M tokens demonstrating equivalent convergence and no differential error accumulation. These results are reported in the updated §3.3 and appendix. revision: yes
Circularity Check
No circularity: claims rest on empirical validation of input-dependent SSMs
full rationale
The paper motivates selective SSMs by identifying a content-based reasoning weakness in prior subquadratic models, proposes making parameters (Δ, B, C) input-dependent as a direct fix, and validates the resulting Mamba architecture through large-scale training and benchmarking on language, audio, and genomics tasks. No derivation step reduces a claimed prediction to a fitted parameter by construction, invokes a self-citation as an unverified uniqueness theorem, or renames an empirical pattern as a first-principles result. The hardware-aware algorithm and end-to-end architecture are presented as engineering choices whose performance is measured externally, leaving the central claims independently falsifiable.
Axiom & Free-Parameter Ledger
free parameters (1)
- model size
axioms (1)
- domain assumption Input-dependent SSM parameters enable content-based reasoning
invented entities (1)
-
selective SSM
no independent evidence
Forward citations
Cited by 60 Pith papers
-
Convergent Stochastic Training of Attention and Understanding LoRA
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
-
Learning the Signature of Memorization in Autoregressive Language Models
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
-
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
-
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...
-
Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on n...
-
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
-
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
VLA stabilizes linear attention by solving regularized least-squares updates with unit-length writes, yielding Jacobian spectral norm exactly 1 and 109x smaller state norms while improving multi-query recall accuracy ...
-
Learning to Focus Synthetic Aperture Radar On-line with State-Space Models
An online SAR focusing framework using state-space models processes raw data line-by-line with 70x lower latency and 130x lower memory than block-based DSP while supporting downstream tasks.
-
TIDES: Implicit Time-Awareness in Selective State Space Models
TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
Test-Time Speculation
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
-
Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)
Prediction bottlenecks do not discover causal structure beyond what linear models, Lasso, and classical Granger/PCMCI methods achieve; intervention benefits are mostly sample-size confounds, leaving a standardized fal...
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
VIMCAN combines Mamba for temporal efficiency and cross-attention for spatial fusion to reach 17.2 mm MPJPE on TotalCapture and 45.3 mm on 3DPW while running above 60 FPS.
-
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
-
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate ...
-
On the Architectural Complexity of Neural Networks
A framework quantifies DNN complexity via tensor operations, links 40 years of breakthroughs to complexity increases, and releases a dataset of 3000+ unexplored high-complexity architectures.
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
-
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
-
ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space
ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.
-
Rethink MAE with Linear Time-Invariant Dynamics
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
-
AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting
AdaMamba adds input-dependent frequency bases and a unified time-frequency forgetting gate to Mamba, yielding higher forecasting accuracy than prior methods on standard long-term time series benchmarks.
-
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
-
GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
GraphLeap decouples per-layer graph construction from feature updates in Vision GNNs by using previous-layer features for the current graph, enabling pipelined FPGA acceleration with up to 95.7× CPU speedup after fine-tuning.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Faster by Design: Interactive Aerodynamics via Neural Surrogates Trained on Expert-Validated CFD
A graph-based neural operator trained on expert-validated race-car CFD data reaches accuracy levels usable for early-stage interactive aerodynamic design exploration.
-
LiquidTAD: Efficient Temporal Action Detection via Parallel Liquid-Inspired Temporal Relaxation
LiquidTAD distills liquid neural dynamics into a vectorized parallel temporal operator and hierarchical decay sharing to achieve efficient action detection with substantially reduced model size and computation.
-
Neural Garbage Collection: Learning to Forget while Learning to Reason
Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
-
DGSSM: Diffusion guided state-space models for multimodal salient object detection
DGSSM formulates multimodal salient object detection as a progressive denoising process using diffusion-guided Mamba models, achieving better boundary accuracy and outperforming prior methods on 13 benchmarks.
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery
LLM chain-of-thought filtering of Mamba saliency features on TCGA-BRCA data produces a 17-gene set with AUC 0.927 that beats both the raw 50-gene saliency list and a 5000-gene baseline while using far fewer features, ...
-
Mamba Sequence Modeling meets Model Predictive Control
Mamba-MPC stabilizes and tracks references on SISO and MIMO systems in simulation and hardware while outperforming LSTM-MPC with faster computation.
-
Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence
Majority-vote ensembles on stationary Markov chains have minimax excess risk Omega(sqrt(Tmix/n)); uniform bagging is suboptimal at Omega(Tmix/sqrt(n)), while adaptive spectral routing matches the optimal rate on a gra...
-
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
-
V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos
V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.
-
Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection
R2VD redefines reconstruction as the origin for residual-guided vector diffusion across PPE, GMP, RSM, and VDI stages to achieve superior anomaly detectability and background suppression on eight datasets.
-
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks
In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spe...
-
Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
HKT is a multi-scale attention architecture that bounds computation at 1.31x standard attention, proves kernel and decomposition properties, and reports accuracy gains on ListOps, sequential CIFAR-10, and character-le...
-
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.
-
Controller Design for Structured State-space Models via Contraction Theory
The paper provides the first controllability and observability analysis for structured state-space models, enabling LMI-based controller synthesis via contraction theory and a separation principle for observers and st...
-
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model
Mamba-2 models fail to learn reversible state retrieval in the UNDO Flip-Flop task, defaulting to a toggle heuristic and achieving only 41% accuracy under adversarial conditions.
-
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
Implicit Behavioral Decoding from Next-Step Spike Forecasts at Population Scale
Mamba forecaster trained on next-step spikes decodes mouse choice at 75.7% and stimulus at 66.1%, beating linear decoding on raw spikes by 4-6 percentage points.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling
U2Diffine augments diffusion denoising with negative log-likelihood loss and first-order uncertainty propagation to jointly perform trajectory completion and provide per-state heteroscedastic uncertainty for multi-age...
-
A Single-Layer Model Can Do Language Modeling
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
-
Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention
Polygon-Mamba achieves F1 scores of 0.8283, 0.8282, and 0.8251 on DRIVE, STARE, and CHASE_DB1 by combining polygon scanning Mamba with space-frequency collaborative attention to better detect small retinal vessels.
-
DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors
DynGhost improves dynamic ghost imaging reconstruction by using a transformer with alternating spatial-temporal attention and quantum-aware training on simulated single-photon detector data.
-
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining
A compact Mamba-2 model performs end-to-end byte-level network traffic classification without tokenization or pre-training and remains competitive with substantially larger pre-trained systems.
-
Nectar: Neural Estimation of Cached-Token Attention via Regression
Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.
-
MBP-KT: Learning Global Collaborative Information from Meta-Behavioral Pattern for Enhanced Knowledge Tracing
MBP-KT uses meta-behavioral pattern sequences and a parameter-free extractor to inject global collaborative information into knowledge tracing models, consistently improving their performance on real datasets.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Reference graph
Works this paper leans on
-
[1]
Unitary Evolution Recurrent Neural Networks
Martin Arjovsky, Amar Shah, and Yoshua Bengio. “Unitary Evolution Recurrent Neural Networks”. In:The Interna- tional Conference on Machine Learning (ICML) . 2016, pp. 1120–1128. 17
work page 2016
-
[2]
Effective Gene Expression Prediction from Sequence by Integrating Long-range Interactions
Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R Ledsam, Agnieszka Grabska-Barwinska, Kyle R Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R Kelley. “Effective Gene Expression Prediction from Sequence by Integrating Long-range Interactions”. In: Nature Methods 18.10 (2021), pp. 1196–1203
work page 2021
-
[3]
Using Fast Weights to Attend to the Recent Past
Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. “Using Fast Weights to Attend to the Recent Past”. In: Advances in Neural Information Processing Systems (NeurIPS) 29 (2016)
work page 2016
-
[4]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer Normalization”. In:arXiv preprint arXiv:1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly Learning to Align and Translate”. In: The International Conference on Learning Representations (ICLR) . 2015
work page 2015
-
[6]
Strongly-typed Recurrent Neural Networks
David Balduzzi and Muhammad Ghifary. “Strongly-typed Recurrent Neural Networks”. In:International Conference on Machine Learning. PMLR. 2016, pp. 1292–1300
work page 2016
-
[7]
Pythia: A Suite for Analyzing Large Language Models across Training and Scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. “Pythia: A Suite for Analyzing Large Language Models across Training and Scaling”. In: The International Conference on Machine Learning (ICML) . PMLR. 2023, pp. 2397–2430
work page 2023
-
[8]
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: Proceedings of the AAAI conference on Artificial Intelligence . Vol. 34. 2020
work page 2020
-
[9]
S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. “Gpt-NeoX-20B: An Open-source Autoregressive Language Model”. In:arXiv preprint arXiv:2204.06745 (2022)
-
[10]
Prefix Sums and Their Applications
Guy E Blelloch. “Prefix Sums and Their Applications”. In: (1990)
work page 1990
-
[11]
Quasi-recurrent Neural Networks
James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. “Quasi-recurrent Neural Networks”. In: arXiv preprint arXiv:1611.01576 (2016)
-
[12]
Language Models are Few-shot Learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language Models are Few-shot Learners”. In: Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), pp. 1877–1901
work page 2020
-
[13]
Scaling transformer to 1m tokens and beyond with rmt,
Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. “Scaling Transformer to 1M tokens and Beyond with RMT”. In: arXiv preprint arXiv:2304.11062 (2023)
-
[14]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. “Generating Long Sequences with Sparse Transformers”. In: arXiv preprint arXiv:1904.10509 (2019)
work page internal anchor Pith review arXiv 1904
-
[15]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. “Rethinking Attention with Performers”. In: The International Conference on Learning Representations (ICLR) . 2021
work page 2021
-
[16]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. “PaLM: Scaling Language Modeling with Pathways”. In: Journal of Machine Learning Research 24.240 (2023), pp. 1–113. url: http://jmlr.org/papers/v24/22- 1144.html
work page 2023
-
[17]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”. In: arXiv preprint arXiv:1412.3555 (2014)
work page internal anchor Pith review arXiv 2014
-
[18]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”. In:The International Conference on Learning Representations (ICLR) . 2024
work page 2024
-
[20]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:Advances in Neural Information Processing Systems (NeurIPS) . 2022
work page 2022
-
[21]
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. “Hungry Hungry Hippos: Towards Language Modeling with State Space Models”. In:The International Conference on Learning Representations (ICLR). 2023
work page 2023
-
[22]
Language Modeling with Gated Convolutional Networks
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. “Language Modeling with Gated Convolutional Networks”. In: The International Conference on Machine Learning (ICML) . PMLR. 2017, pp. 933–941
work page 2017
- [23]
-
[24]
LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,
Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei. “LongNet: Scaling Transformers to 1,000,000,000 Tokens”. In:arXiv preprint arXiv:2307.02486 (2023). 18
-
[25]
Chris Donahue, Julian McAuley, and Miller Puckette. “Adversarial Audio Synthesis”. In:The International Conference on Learning Representations (ICLR) . 2019
work page 2019
-
[26]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: The International Conference on Learning Representations (ICLR) . 2020
work page 2020
-
[27]
A Mathematical Framework for Transformer Circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. “...
work page 2021
-
[28]
Mahan Fathi, Jonathan Pilault, Pierre-Luc Bacon, Christopher Pal, Orhan Firat, and Ross Goroshin. “Block-State Transformer”. In: arXiv preprint arXiv:2306.09539 (2023)
-
[29]
Multi-Head State Space Model for Speech Recognition
Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, and Mark J. F. Gales. “Multi-Head State Space Model for Speech Recognition”. In: Proc. INTERSPEECH 2023 . 2023, pp. 241–245. doi: 10.21437/Interspeech.2023-1036
-
[30]
Karl J Friston, Lee Harrison, and Will Penny. “Dynamic Causal Modelling”. In:Neuroimage 19.4 (2003), pp. 1273– 1302
work page 2003
-
[31]
Simple Hardware-efficient Long Convolutions for Sequence Modeling
Daniel Y Fu, Elliot L Epstein, Eric Nguyen, Armin W Thomas, Michael Zhang, Tri Dao, Atri Rudra, and Christopher Ré. “Simple Hardware-efficient Long Convolutions for Sequence Modeling”. In:The International Conference on Machine Learning (ICML) (2023)
work page 2023
-
[32]
Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks
Ken-ichi Funahashi and Yuichi Nakamura. “Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks”. In: Neural Networks 6.6 (1993), pp. 801–806
work page 1993
-
[33]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In: arXiv preprint arXiv:2101.00027 (2020)
work page internal anchor Pith review arXiv 2020
-
[34]
Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A Framework for Few-shot Language Model Evaluation . Version v0.0.1. Sept. 2021. doi: 10.5281/zenodo.5371628. url: http...
-
[35]
It’s Raw! Audio Generation with State-Space Models
Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. “It’s Raw! Audio Generation with State-Space Models”. In: The International Conference on Machine Learning (ICML) . 2022
work page 2022
-
[36]
HIPPO: Recurrent Memory with Optimal Polynomial Projections
Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. “HIPPO: Recurrent Memory with Optimal Polynomial Projections”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2020
work page 2020
-
[37]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Ré. “Efficiently Modeling Long Sequences with Structured State Spaces”. In: The International Conference on Learning Representations (ICLR) . 2022
work page 2022
-
[38]
Improving the Gating Mechanism of Recurrent Neural Networks
Albert Gu, Caglar Gulcehre, Tom Le Paine, Matt Hoffman, and Razvan Pascanu. “Improving the Gating Mechanism of Recurrent Neural Networks”. In: The International Conference on Machine Learning (ICML) . 2020
work page 2020
-
[39]
On the Parameterization and Initialization of Diagonal State Space Models
Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. “On the Parameterization and Initialization of Diagonal State Space Models”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2022
work page 2022
-
[40]
Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer
Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. “Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer”. In:Advances in Neural Information Processing Systems (NeurIPS). 2021
work page 2021
-
[41]
How to Train Your HIPPO: State Space Models with Generalized Basis Projections
Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. “How to Train Your HIPPO: State Space Models with Generalized Basis Projections”. In: The International Conference on Learning Representations (ICLR) . 2023
work page 2023
-
[42]
Diagonal State Spaces are as Effective as Structured State Spaces
Ankit Gupta, Albert Gu, and Jonathan Berant. “Diagonal State Spaces are as Effective as Structured State Spaces”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 22982–22994
work page 2022
-
[43]
Simplifying and Understanding State Space Models with Diagonal Linear RNNs
Ankit Gupta, Harsh Mehta, and Jonathan Berant. “Simplifying and Understanding State Space Models with Diagonal Linear RNNs”. In: arXiv preprint arXiv:2212.00768 (2022)
-
[44]
David Ha, Andrew Dai, and Quoc V. Le. “HyperNetworks”. In:The International Conference on Learning Representa- tions (ICLR). 2017
work page 2017
-
[45]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. “Dream to Control: Learning Behaviors by Latent Imagination”. In: The International Conference on Learning Representations (ICLR) . 2020. 19
work page 2020
-
[46]
Liquid Structural State-Space Models
Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. “Liquid Structural State-Space Models”. In: The International Conference on Learning Representations (ICLR) . 2023
work page 2023
-
[47]
Recurrent Orthogonal Networks and Long-Memory Tasks
Mikael Henaff, Arthur Szlam, and Yann LeCun. “Recurrent Orthogonal Networks and Long-Memory Tasks”. In: The International Conference on Machine Learning (ICML) . 2016
work page 2016
-
[48]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. “Gaussian Error Linear Units (GELUs)”. In: arXiv preprint arXiv:1606.08415 (2016)
work page Pith review arXiv 2016
-
[49]
Untersuchungen zu dynamischen neuronalen Netzen
Sepp Hochreiter. “Untersuchungen zu dynamischen neuronalen Netzen”. In: Diploma, Technische Universität München 91.1 (1991), p. 31
work page 1991
-
[50]
Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies
Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies. 2001
work page 2001
-
[51]
Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In:Neural Computation 9.8 (1997), pp. 1735– 1780
work page 1997
-
[52]
An Empirical Analysis of Compute- Optimal Large Language Model Training
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. “An Empirical Analysis of Compute- Optimal Large Language Model Training”. In:Advances in Neural Information Processing Systems (NeurIPS) 35 (2022), pp. 30016–30030
work page 2022
-
[53]
Transformer Quality in Linear Time
Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. “Transformer Quality in Linear Time”. In:The International Conference on Machine Learning (ICML) . PMLR. 2022, pp. 9099–9117
work page 2022
-
[54]
Deep Learning for Time Series Classification: A Review
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. “Deep Learning for Time Series Classification: A Review”. In: Data Mining and Knowledge Discovery 33.4 (2019), pp. 917– 963
work page 2019
-
[55]
Data Movement is All You Need: A Case Study on Optimizing Transformers
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. “Data Movement is All You Need: A Case Study on Optimizing Transformers”. In: Proceedings of Machine Learning and Systems 3 (2021), pp. 711–732
work page 2021
-
[56]
Gated Orthogonal Recurrent Units: On Learning to Forget
Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljacic, and Yoshua Bengio. “Gated Orthogonal Recurrent Units: On Learning to Forget”. In: Neural Computation 31.4 (2019), pp. 765–783
work page 2019
-
[57]
A New Approach to Linear Filtering and Prediction Problems
Rudolph Emil Kalman. “A New Approach to Linear Filtering and Prediction Problems”. In: (1960)
work page 1960
-
[58]
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”. In:International Conference on Machine Learning . PMLR. 2020, pp. 5156–5165
work page 2020
-
[59]
Linear Dynamical Systems as a Core Computational Primitive
Shiva Kaul. “Linear Dynamical Systems as a Core Computational Primitive”. In:Advances in Neural Information Processing Systems 33 (2020), pp. 16808–16820
work page 2020
-
[60]
DiffWave: A Versatile Diffusion Model for Audio Synthesis
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. “DiffWave: A Versatile Diffusion Model for Audio Synthesis”. In:International Conference on Learning Representations . 2021
work page 2021
-
[61]
Time-Parameterized Convolutional Neural Networks for Irregularly Sampled Time Series
Chrysoula Kosma, Giannis Nikolentzos, and Michalis Vazirgiannis. “Time-Parameterized Convolutional Neural Networks for Irregularly Sampled Time Series”. In: arXiv preprint arXiv:2308.03210 (2023)
-
[62]
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems (NeurIPS) 25 (2012)
work page 2012
-
[63]
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
Tao Lei. “When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . 2021, pp. 7633–7648
work page 2021
-
[64]
Simple Recurrent Units for Highly Parallelizable Recur- rence
Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. “Simple Recurrent Units for Highly Parallelizable Recurrence”. In: arXiv preprint arXiv:1709.02755 (2017)
-
[65]
Mario Lezcano-Casado and David Martínez-Rubio. “Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group”. In: The International Conference on Machine Learning (ICML). 2019
work page 2019
-
[66]
What Makes Convolutional Models Great on Long Sequence Modeling?
Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. “What Makes Convolutional Models Great on Long Sequence Modeling?” In: The International Conference on Learning Representations (ICLR) . 2023
work page 2023
-
[67]
Time-aware Large Kernel Convolutions
Vasileios Lioutas and Yuhong Guo. “Time-aware Large Kernel Convolutions”. In:The International Conference on Machine Learning (ICML). PMLR. 2020, pp. 6172–6183
work page 2020
-
[68]
Structured State Space Models for In-Context Reinforcement Learning
Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. “Structured State Space Models for In-Context Reinforcement Learning”. In:Advances in Neural Information Processing Systems (NeurIPS). 2023
work page 2023
-
[69]
Focus Your Attention (with Adaptive IIR Filters)
Shahar Lutati, Itamar Zimerman, and Lior Wolf. “Focus Your Attention (with Adaptive IIR Filters)”. In:arXiv preprint arXiv:2305.14952 (2023). 20
-
[70]
Mega: Moving Average Equipped Gated Attention
Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. “Mega: Moving Average Equipped Gated Attention”. In:The International Conference on Learning Representations (ICLR). 2023
work page 2023
-
[71]
Parallelizing Linear Recurrent Neural Nets Over Sequence Length
Eric Martin and Chris Cundy. “Parallelizing Linear Recurrent Neural Nets Over Sequence Length”. In:The Interna- tional Conference on Learning Representations (ICLR) . 2018
work page 2018
-
[72]
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model”. In:The International Conference on Learning Representations (ICLR) . 2017
work page 2017
-
[73]
Long Range Language Modeling via Gated State Spaces
Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. “Long Range Language Modeling via Gated State Spaces”. In: The International Conference on Learning Representations (ICLR) . 2023
work page 2023
-
[74]
Efficient Orthogonal Parametrisation of Recurrent Neural Networks using Householder Reflections
Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. “Efficient Orthogonal Parametrisation of Recurrent Neural Networks using Householder Reflections”. In:International Conference on Machine Learning . PMLR. 2017, pp. 2401–2409
work page 2017
-
[75]
S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces
Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. “S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces”. In:Advances in Neural Information Processing Systems (NeurIPS) . 2022
work page 2022
-
[76]
HyenaDNA: Long-range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, et al. “HyenaDNA: Long-range Genomic Sequence Modeling at Single Nucleotide Resolution”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2023
work page 2023
-
[77]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page 2022
-
[78]
WaveNet: A Generative Model for Raw Audio
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. “WaveNet: A Generative Model for Raw Audio”. In: arXiv preprint arXiv:1609.03499 (2016)
work page internal anchor Pith review arXiv 2016
-
[79]
Resurrecting Recurrent Neural Networks for Long Sequences
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting Recurrent Neural Networks for Long Sequences”. In:The International Conference on Machine Learning (ICML). 2023
work page 2023
-
[80]
The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics . 2016, pp. 1525–1534
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.