PAL uses the classical Preisach hysteresis operator with learned thresholds and an extrema stack to model sequences, proving O(1)-depth Turing completeness via two-stack PDA simulation and incomparability with standard transformers on rate-independent vs. random-access functions.
super hub Canonical reference
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Canonical reference. 74% of citing Pith papers cite this work as background.
abstract
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoni
authors
co-cited works
representative citing papers
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
Sparse attention arises from compact kernel regression, with Epanechnikov and similar kernels mapping to normalized ReLU, sparsemax, and alpha-entmax attention.
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
RESOLVE provides a controlled multi-resolution LiDAR and camera benchmark for evaluating 3D detection and tracking under point sparsity variations in roadside cooperative perception.
FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.
IFM conditions flow-matching velocity fields on patient history and planned treatments, using velocity-field Jacobian regularization to enforce signed, dose-bounded insulin-lowering and carbohydrate-raising effects on glucose in simulated UVA/Padova type 1 diabetes data.
CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.
MIDS uses bidirectional Mamba for CAN bus masquerade and tampering detection, achieving F1 of 96.94% on a new Tesla dataset and 93.70-99.61% on public benchmarks while outperforming baselines.
Ego-METAS is a new benchmark providing unified egocentric video data, splits, features and baselines for online multimodal temporal action segmentation under hardware-representative energy constraints.
AdaState replaces the static first-frame KV anchor with an evolving hidden latent that the model denoises alongside content, treating time as relative to enable recurrence and richer dynamics in streaming video generation.
Introduces evolutionary traffic datasets and a yearly streaming protocol, finding that many SOTA methods fail when sensor networks grow over decades.
CaMBRAIN introduces a causal Mamba-based SSM with a multi-stage self-supervised training pipeline that achieves SOTA results on three EEG datasets while enabling linear-time long-range inference.
Meta-Attention introduces per-token Bayesian routing among attention mechanisms via amortised variational inference with a Dirichlet prior, yielding lower projected FLOP cost than prior-free routing on a Tiny LM benchmark.
LLM-MvP adapts multi-view prompting to LLMs via constrained decoding to close the gap between few-shot prompting and fine-tuned models on five ABSA benchmarks while cutting inference overhead.
QGS introduces query-item pair encoding and query-conditioned prediction with a linear HSTU encoder and HFG-Attention to reduce noise from query switches in generative search ranking, reporting online gains in a commercial system.
UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.
SliceWorld introduces a world-state model for CT report generation that uses predictive and factor-aware objectives on axial slice sequences.
citing papers explorer
-
Preisach Attention: A Hysteretic Model of Sequential Memory
PAL uses the classical Preisach hysteresis operator with learned thresholds and an extrema stack to model sequences, proving O(1)-depth Turing completeness via two-stack PDA simulation and incomparability with standard transformers on rate-independent vs. random-access functions.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
Convergent Stochastic Training of Attention and Understanding LoRA
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
-
Learning the Signature of Memorization in Autoregressive Language Models
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
-
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
-
When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models
Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
-
Sparse Attention as Compact Kernel Regression
Sparse attention arises from compact kernel regression, with Epanechnikov and similar kernels mapping to normalized ReLU, sparsemax, and alpha-entmax attention.
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
VMamba: Visual State Space Model
VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
-
RESOLVE: A Multi-Resolution and Multi-Modal Dataset for Roadside Cooperative Perception
RESOLVE provides a controlled multi-resolution LiDAR and camera benchmark for evaluating 3D detection and tracking under point sparsity variations in roadside cooperative perception.
-
Morphing into Hybrid Attention Models
FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.
-
Interventional Flow Matching: Prospective Dose-Response Forecasting with Velocity-Field Jacobian Regularization
IFM conditions flow-matching velocity fields on patient history and planned treatments, using velocity-field Jacobian regularization to enforce signed, dose-bounded insulin-lowering and carbohydrate-raising effects on glucose in simulated UVA/Padova type 1 diabetes data.
-
CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention
CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.
-
MIDS: Detecting Stealthy Masquerade and Tampering Attacks on CAN Bus via Bidirectional Mamba
MIDS uses bidirectional Mamba for CAN bus masquerade and tampering detection, achieving F1 of 96.94% on a new Tesla dataset and 93.70-99.61% on public benchmarks while outperforming baselines.
-
Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark
Ego-METAS is a new benchmark providing unified egocentric video data, splits, features and baselines for online multimodal temporal action segmentation under hardware-representative energy constraints.
-
AdaState: Self-Evolving Anchors for Streaming Video Generation
AdaState replaces the static first-frame KV anchor with an evolving hidden latent that the model denoises alongside content, treating time as relative to enable recurrence and richer dynamics in streaming video generation.
-
From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks
Introduces evolutionary traffic datasets and a yearly streaming protocol, finding that many SOTA methods fail when sensor networks grow over decades.
-
CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models
CaMBRAIN introduces a causal Mamba-based SSM with a multi-stage self-supervised training pipeline that achieves SOTA results on three EEG datasets while enabling linear-time long-range inference.
-
Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference
Meta-Attention introduces per-token Bayesian routing among attention mechanisms via amortised variational inference with a Dirichlet prior, yielding lower projected FLOP cost than prior-free routing on a Tiny LM benchmark.
-
Prompting Is All You Need: Multi-view Prompting Large Language Models for Aspect-Based Sentiment Analysis
LLM-MvP adapts multi-view prompting to LLMs via constrained decoding to close the gap between few-shot prompting and fine-tuned models on five ABSA benchmarks while cutting inference overhead.
-
From Item-Only to Query-Item: Query-Conditioned Generative Search with QGS in Quark
QGS introduces query-item pair encoding and query-conditioned prediction with a linear HSTU encoder and HFG-Attention to reduce noise from query switches in generative search ranking, reporting online gains in a commercial system.
-
UWM-JEPA: Predictive World Models That Imagine in Belief Space
UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.
-
SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation
SliceWorld introduces a world-state model for CT report generation that uses predictive and factor-aware objectives on axial slice sequences.
-
Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference
AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.
-
Tensor Cache: Eviction-conditioned Associative Memory for Transformers
Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.
-
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
-
KVBuffer: IO-aware Serving for Linear Attention
KVBuffer reduces linear attention decoding latency by up to 45% and increases speculative decoding throughput 5x by buffering keys/values for flexible chunked and parallel computation.
-
Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation
Patch-MoE Mamba introduces patch-ordered hierarchical scanning and an MoE-based directional fusion module to improve Mamba segmentation models on polyp and skin lesion datasets.
-
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
-
Identify Then Project: Contrastive Learning of Latent Dynamics from Partial Observations with Port-Hamiltonian Structure
A two-stage contrastive teacher-student framework learns and then projects latent dynamics onto port-Hamiltonian submanifolds from partial observations.
-
Public-Decay Homomorphic State Space Models for Private Sequence Inference
Public-decay HSSMs achieve exact plaintext-matching accuracy (0.7505/0.7420) on Rotten Tomatoes and SST-2 while running ~5x faster than polynomial attention under FHE constraints.
-
Transformer-like Inference from Optimal Control
Derives transformer-like dual-filter inference layers from first-principles optimal control on nonlinear discrete and linear Gaussian sequence models.
-
Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models
Social-Mamba introduces a Cycle Mamba block and social triplet factorization to achieve state-of-the-art trajectory forecasting accuracy with linear-time social interaction modeling on five benchmarks.
-
DSSP: Diffusion State Space Policy with Full-History Encoding
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
-
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.
-
A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures
A real Schur decomposition projection maps the state matrix of discrete-time state-space layers onto its nearest stable counterpart, delivering accuracy comparable to prior stable identification methods with fewer weights.
-
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
-
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
-
Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction
Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
-
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.
-
Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on nuScenes with up to 34% MAE reduction.
-
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
-
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
VLA stabilizes linear attention by solving regularized least-squares updates with unit-length writes, yielding Jacobian spectral norm exactly 1 and 109x smaller state norms while improving multi-query recall accuracy over standard linear attention and DeltaNet.
-
Learning to Focus Synthetic Aperture Radar On-line with State-Space Models
An online SAR focusing framework using state-space models processes raw data line-by-line with 70x lower latency and 130x lower memory than block-based DSP while supporting downstream tasks.
-
TIDES: Implicit Time-Awareness in Selective State Space Models
TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)
Prediction bottlenecks do not discover causal structure beyond what linear models, Lasso, and classical Granger/PCMCI methods achieve; intervention benefits are mostly sample-size confounds, leaving a standardized falsification benchmark as the main contribution.
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
VIMCAN combines Mamba for temporal efficiency and cross-attention for spatial fusion to reach 17.2 mm MPJPE on TotalCapture and 45.3 mm on 3DPW while running above 60 FPS.