WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
super hub Canonical reference
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoni
authors
co-cited works
fields
cs.LG 63 cs.CV 52 cs.CL 24 cs.AI 8 eess.AS 3 eess.IV 3 eess.SP 3 eess.SY 3 q-bio.NC 3 cs.CR 2roles
background 5representative citing papers
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.
Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on nuScenes with up to 34% MAE reduction.
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
VLA stabilizes linear attention by solving regularized least-squares updates with unit-length writes, yielding Jacobian spectral norm exactly 1 and 109x smaller state norms while improving multi-query recall accuracy over standard linear attention and DeltaNet.
An online SAR focusing framework using state-space models processes raw data line-by-line with 70x lower latency and 130x lower memory than block-based DSP while supporting downstream tasks.
TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
Prediction bottlenecks do not discover causal structure beyond what linear models, Lasso, and classical Granger/PCMCI methods achieve; intervention benefits are mostly sample-size confounds, leaving a standardized falsification benchmark as the main contribution.
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
VIMCAN combines Mamba for temporal efficiency and cross-attention for spatial fusion to reach 17.2 mm MPJPE on TotalCapture and 45.3 mm on 3DPW while running above 60 FPS.
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
A framework quantifies DNN complexity via tensor operations, links 40 years of breakthroughs to complexity increases, and releases a dataset of 3000+ unexplored high-complexity architectures.
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
citing papers explorer
-
Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on nuScenes with up to 34% MAE reduction.
-
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
-
VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
VIMCAN combines Mamba for temporal efficiency and cross-attention for spatial fusion to reach 17.2 mm MPJPE on TotalCapture and 45.3 mm on 3DPW while running above 60 FPS.
-
Rethink MAE with Linear Time-Invariant Dynamics
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
-
V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos
V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.
-
Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection
R2VD redefines reconstruction as the origin for residual-guided vector diffusion across PPE, GMP, RSM, and VDI stages to achieve superior anomaly detectability and background suppression on eight datasets.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
GraphLeap decouples per-layer graph construction from feature updates in Vision GNNs by using previous-layer features for the current graph, enabling pipelined FPGA acceleration with up to 95.7× CPU speedup after fine-tuning.
-
LiquidTAD: Efficient Temporal Action Detection via Parallel Liquid-Inspired Temporal Relaxation
LiquidTAD distills liquid neural dynamics into a vectorized parallel temporal operator and hierarchical decay sharing to achieve efficient action detection with substantially reduced model size and computation.
-
DGSSM: Diffusion guided state-space models for multimodal salient object detection
DGSSM formulates multimodal salient object detection as a progressive denoising process using diffusion-guided Mamba models, achieving better boundary accuracy and outperforming prior methods on 13 benchmarks.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
-
Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention
Polygon-Mamba achieves F1 scores of 0.8283, 0.8282, and 0.8251 on DRIVE, STARE, and CHASE_DB1 by combining polygon scanning Mamba with space-frequency collaborative attention to better detect small retinal vessels.
-
DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors
DynGhost improves dynamic ghost imaging reconstruction by using a transformer with alternating spatial-temporal attention and quantum-aware training on simulated single-photon detector data.
-
GEM: Generating LiDAR World Model via Deformable Mamba
GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
-
Detecting AI-Generated Videos with Spiking Neural Networks
MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.
-
A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series
A graph-regulated disentangling Mamba model with sparse tokens achieves 93.94% accuracy classifying tree species from MODIS time series in Alberta and outperforms twelve prior models.
-
GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution
GTF is an omnidirectional EPI Transformer for light field super-resolution that models horizontal, vertical, 45-degree and 135-degree epipolar geometries, reaching 32.78 dB on benchmarks and top ranks in the NTIRE 2026 challenge.
-
PACE: Post-Causal Entropy Modeling for Learned LiDAR Point Cloud Compression
PACE achieves state-of-the-art LiDAR point cloud compression with over 90% lower decoding latency by using a non-causal backbone and a stage-scalable causal predictor.
-
SAMamba3D: adapting Segment Anything for generalizable 3D segmentation of multiphase pore-scale images
SAMamba3D adapts a frozen SAM encoder with Mamba volumetric context and cross-scale features to match or exceed 3D baselines on diverse sandstone and carbonate datasets while reducing case-specific retraining.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
-
Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking
MambaTrack improves RGB-Event object tracking via event-adaptive state transitions in a Dynamic State Space Model and a Gated Projection Fusion module, reporting state-of-the-art results on FE108 and FELT datasets.
-
CloudMamba: An Uncertainty-Guided Dual-Scale Mamba Network for Cloud Detection in Remote Sensing Imagery
CloudMamba combines uncertainty-guided refinement with a dual-scale Mamba network to outperform prior methods on cloud segmentation accuracy while maintaining linear computational cost.
-
Physics-Aligned Spectral Mamba: Decoupling Semantics and Dynamics for Few-Shot Hyperspectral Target Detection
SpecMamba decouples stable semantic features from agile spectral adaptation via DCT-Mamba adapters, prior-guided tri-encoders, and self-supervised test-time mapping to improve few-shot hyperspectral target detection.
-
Edge-Efficient Image Restoration: Transformer Distillation into State-Space Models
Hybrid transformer-SSM networks found by multi-objective search run 1.17x to 3.4x faster on edge CPUs for image restoration tasks with competitive quality.
-
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.
-
TriTS: Time Series Forecasting from a Multimodal Perspective
TriTS projects time series into time, frequency, and vision modalities with Period-Aware Reshaping and MR-WM to achieve SOTA long-term forecasting at lower computational cost.
-
Hero-Mamba: Mamba-based Dual Domain Learning for Underwater Image Enhancement
Hero-Mamba combines parallel spatial-spectral Mamba processing and a background-light-guided ColorFusion block to enhance underwater images, reporting PSNR 25.802 and SSIM 0.913 on the LSUI benchmark.
-
CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder
CLIMB generates controllable longitudinal brain MRI images from baseline scans using a Mamba-based latent diffusion model and Gaussian-aligned autoencoder, reporting SSIM 0.9433 on the ADNI dataset of 6306 scans.
-
HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.
-
MSR:Hybrid Field Modeling for CT-MRI Rigid-Deformable Registration of the Cervical Spine with an Annotated Dataset
MSR fuses per-vertebra rigid alignments with a gated Mamba-Swin deformable module to produce a hybrid deformation field for CT-MRI cervical spine registration, released together with the public R-D-Reg annotated dataset.
-
Selective Attention-Based Network for Robust Infrared Small Target Detection
SANet augments U-Net with a Dual-path Semantic-aware Module using pinwheel convolutions and CBAM, plus a Selective Attention Fusion Module for adaptive cross-scale feature fusion, to improve detection of sub-pixel infrared targets.
-
A Synergistic CNN-Transformer Network with Pooling Attention Fusion for Hyperspectral Image Classification
A new CNN-Transformer hybrid with twin-branch 3D/2D convolution, hybrid pooling attention, cascade spectral transformers, and cross-layer fusion reports higher accuracy than prior methods on standard hyperspectral datasets.
-
Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation
DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.
-
NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results
The NTIRE 2026 Challenge released a public dataset of 2,000 videos with crowdsourced saliency maps and reported results from participating teams using standard quality metrics.
-
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
HyperSSM integrates hypergraphs and state space models to let correlated objects mutually refine motion estimates, stabilizing trajectories under noise and occlusion for state-of-the-art multi-object tracking.
-
Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection
A new adapter module combining boundary-aware state space modeling with spatial processing boosts localization and robustness in temporal action detection.
-
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining computational efficiency for real-time use.
-
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.
-
MambaKick: Early Penalty Direction Prediction from HAR Embeddings
MambaKick reuses pretrained HAR embeddings with Mamba temporal modeling to predict penalty kick direction, reaching 53.1% accuracy on three classes and 64.5% on two classes.
-
SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification
SSMamba uses a two-stage self-supervised pretraining and fine-tuning pipeline with Mamba-based components to outperform prior pathological foundation models on ROI and WSI classification tasks.
-
Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels
Dual-head training on hierarchical OA labels yields backbone-dependent gains in KL metrics, more ordered latent severity axes, and better saliency alignment with cartilage for some 3D backbones.
-
ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification
ConvVitMamba integrates multiscale convolution, transformer encoding, and Mamba-based refinement with PCA to outperform prior CNN, ViT, and Mamba methods in accuracy, size, and speed on four HSI benchmark datasets.
-
Beyond ZOH: Advanced Discretization Strategies for Vision Mamba
Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.
-
A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs
Hybrid EfficientNetV2-M and Vision Mamba architecture achieves strong binary classification performance on abnormality-centered mammography ROIs from CBIS-DDSM.
-
The First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
The NTIRE 2026 challenge establishes a benchmark for x4 super-resolution of remote sensing infrared images, with 13 teams submitting valid methods evaluated on a dedicated dataset.
-
The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview
The NTIRE 2026 ×4 super-resolution challenge benchmarks 31 teams on bicubic-downsampled images using PSNR for the restoration track and perceptual scores for the realism track.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
-
The Eleventh NTIRE 2026 Efficient Super-Resolution Challenge Report
The NTIRE 2026 report benchmarks 15 valid submissions that maintain ~26.9 dB PSNR on DIV2K_LSDIR while reducing runtime, parameters and FLOPs.
-
Mixture-of-Experts in Remote Sensing: A Survey
A survey providing the first systematic overview of Mixture-of-Experts models applied to remote sensing tasks, covering principles, architectures, applications, and future trends.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.