CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
hub Mixed citations
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Mixed citation behavior. Most common role is background (62%).
abstract
In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our experiments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be comparable to LSTM.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our experiments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be comparable to LSTM.
co-cited works
representative citing papers
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-global logarithm resummation in the large-Nc limit.
A two-stage contrastive teacher-student framework learns and then projects latent dynamics onto port-Hamiltonian submanifolds from partial observations.
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.
The What-Where Transformer achieves explicit what-where separation in a ViT-style backbone via concurrent token and attention-map streams, yielding emergent object discovery from attention maps and better weakly-supervised localization.
TailedTS supplies 24.69 billion Wikipedia page-view records as a public benchmark for heavy-tailed time series forecasting and periodicity analysis, revealing weaker periodic structure in high-traffic pages.
TCRTransBench provides a new benchmark with bidirectional TCR-peptide generation tasks, a large validated dataset, and metrics to evaluate neural models for immunological sequence modeling.
CLOVER augments value decomposition with a GNN mixer whose weights depend on the realized wireless communication graph, proving permutation invariance, monotonicity, and greater expressiveness than QMIX while showing gains on Predator-Prey and Lumberjacks under p-CSMA channels.
Damped harmonic oscillators with closed-form solutions model keys, values, and queries in continuous attention for irregular time series, preserving universal approximation while being orders of magnitude faster than prior NODE-based methods.
NeuroKalman mitigates state drift in vision-language UAV navigation by using memory-augmented Kalman filtering where attention retrieves historical anchors to correct predictions without gradient updates.
ExDoS uses expert-guided dual-focus distillation between source semantic graphs and bytecode control-flow graphs plus a dual-attention network to improve smart contract vulnerability detection, reporting 3-6% F1 gains over baselines.
Unsupervised GNN model learns local updates for approximate MaxIS on dynamic graphs, achieving competitive ratios on 200-1000 node instances and 1.00-1.18x larger solutions than other unsupervised models when generalizing to 100x larger graphs.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
ComHymba introduces a domain-informed wireless foundation model with Hymba blocks for linear-complexity CSI modeling, reporting accuracy gains on eight downstream tasks and up to 3.3x inference speedup over Transformers.
TopoPrior learns transferable topology priors offline from multi-domain reference graphs using a conditional variational graph model and adversarial adaptation to initialize collaboration structures for multi-agent LLM systems, reducing online search overhead.
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
ScanVLA uses a vision-language model with a history-enhanced decoder and frozen segmentation LoRA to outperform prior methods on object-referring scanpath prediction.
UniDetect is an LLM-based system that generates universal transaction summary texts and uses two-stage multimodal training on text plus graphs to detect fraudulent accounts across heterogeneous blockchains, outperforming baselines by 5.57-7.58% KS and achieving over 94.58% zero-shot cross-chain and
A physics-informed neural representation is learned from safe data to support distributional hypothesis testing for dynamical instability in stochastic DAE systems without repeated simulations.
RF-LEGO turns signal processing algorithms into trainable modular DL modules via deep unrolling, outperforming pure SP and DL baselines in RF sensing while preserving interpretability.
BAIM enriches knowledge tracing item representations by deriving stage-level embeddings from Polya's four problem-solving stages and routing them adaptively per learner context, yielding consistent gains over pretraining baselines on two datasets.
citing papers explorer
-
CanViT: Toward Active-Vision Foundation Models
CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Nested-GPT for variable-multiplicity parton showers: A case study in the resummation of non-global logarithms
Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-global logarithm resummation in the large-Nc limit.
-
Identify Then Project: Contrastive Learning of Latent Dynamics from Partial Observations with Port-Hamiltonian Structure
A two-stage contrastive teacher-student framework learns and then projects latent dynamics onto port-Hamiltonian submanifolds from partial observations.
-
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
-
Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction
PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.
-
What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization
The What-Where Transformer achieves explicit what-where separation in a ViT-style backbone via concurrent token and attention-map streams, yielding emergent object discovery from attention maps and better weakly-supervised localization.
-
TailedTS: Benchmark Dataset for Heavy-Tailed Time Series Prediction and Periodicity Quantification
TailedTS supplies 24.69 billion Wikipedia page-view records as a public benchmark for heavy-tailed time series forecasting and periodicity analysis, revealing weaker periodic structure in high-traffic pages.
-
TCRTransBench: A Comprehensive Benchmark for Bidirectional TCR-Peptide Sequence Generation
TCRTransBench provides a new benchmark with bidirectional TCR-peptide generation tasks, a large validated dataset, and metrics to evaluate neural models for immunological sequence modeling.
-
Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning
CLOVER augments value decomposition with a GNN mixer whose weights depend on the realized wireless communication graph, proving permutation invariance, monotonicity, and greater expressiveness than QMIX while showing gains on Predator-Prey and Lumberjacks under p-CSMA channels.
-
Oscillators Are All You Need: Irregular Time Series Modelling via Damped Harmonic Oscillators with Closed-Form Solutions
Damped harmonic oscillators with closed-form solutions model keys, values, and queries in continuous attention for irregular time series, preserving universal approximation while being orders of magnitude faster than prior NODE-based methods.
-
Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering
NeuroKalman mitigates state drift in vision-language UAV navigation by using memory-augmented Kalman filtering where attention retrieves historical anchors to correct predictions without gradient updates.
-
ExDoS: Expert-Guided Dual-Focus Cross-Modal Distillation for Smart Contract Vulnerability Detection
ExDoS uses expert-guided dual-focus distillation between source semantic graphs and bytecode control-flow graphs plus a dual-attention network to improve smart contract vulnerability detection, reporting 3-6% F1 gains over baselines.
-
Unsupervised Learning of Local Updates for Maximum Independent Set in Dynamic Graphs
Unsupervised GNN model learns local updates for approximate MaxIS on dynamic graphs, achieving competitive ratios on 200-1000 node instances and 1.00-1.18x larger solutions than other unsupervised models when generalizing to 100x larger graphs.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
-
ComHymba: Low-Complexity Domain-Informed Foundation Model for Wireless Communications
ComHymba introduces a domain-informed wireless foundation model with Hymba blocks for linear-complexity CSI modeling, reporting accuracy gains on eight downstream tasks and up to 3.3x inference speedup over Transformers.
-
Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains
TopoPrior learns transferable topology priors offline from multi-domain reference graphs using a conditional variational graph model and adversarial adaptation to initialize collaboration structures for multi-agent LLM systems, reducing online search overhead.
-
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
-
Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models
ScanVLA uses a vision-language model with a history-enhanced decoder and frozen segmentation LoRA to outperform prior methods on object-referring scanpath prediction.
-
UniDetect: LLM-Driven Universal Fraud Detection across Heterogeneous Blockchains
UniDetect is an LLM-based system that generates universal transaction summary texts and uses two-stage multimodal training on text plus graphs to detect fraudulent accounts across heterogeneous blockchains, outperforming baselines by 5.57-7.58% KS and achieving over 94.58% zero-shot cross-chain and
-
Learning to Test: Physics-Informed Representation for Dynamical Instability Detection
A physics-informed neural representation is learned from safe data to support distributional hypothesis testing for dynamical instability in stochastic DAE systems without repeated simulations.
-
RF-LEGO: Modularized Signal Processing-Deep Learning Co-Design for RF Sensing via Deep Unrolling
RF-LEGO turns signal processing algorithms into trainable modular DL modules via deep unrolling, outperforming pure SP and DL baselines in RF sensing while preserving interpretability.
-
Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing
BAIM enriches knowledge tracing item representations by deriving stage-level embeddings from Polya's four problem-solving stages and routing them adaptively per learner context, yielding consistent gains over pretraining baselines on two datasets.
-
CWRNN-INVR: A Coupled WarpRNN based Implicit Neural Video Representation
CWRNN-INVR combines WarpRNN for structured video information and residual grids for irregular details to reach 33.73 dB average PSNR on the UVG dataset at 3M parameters, outperforming existing INVR methods.
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
-
Uniform Inductive Spatio-Temporal Kriging
UniSTOK improves inductive spatio-temporal kriging under incomplete observations by reliability-guided signal regulation and residual bias calibration.
-
SCASRec: A Self-Correcting and Auto-Stopping Model for Generative Route List Recommendation
SCASRec unifies ranking and redundancy elimination for route lists via stepwise corrective rewards and an adaptive end-of-recommendation token, claiming SOTA results on two datasets and real deployment.
-
Short window attention enables long-term memorization
Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.
-
Flow marching for a generative PDE foundation model
Flow Marching jointly samples noise and physical time to learn a velocity field for generative PDE modeling, paired with a latent autoencoder and efficient transformer for large-scale pretraining on 2.5M trajectories.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
-
Logo-LLM: Local and Global Modeling with Large Language Models for Time Series Forecasting
Logo-LLM improves time series forecasting by pulling local dynamics from shallow LLM layers and global trends from deeper layers, then aligning them via new Local-Mixer and Global-Mixer modules.
-
DeePen: Penetration Testing for Audio Deepfake Detection
DeePen demonstrates that both production and academic audio deepfake detectors can be reliably deceived by simple signal processing attacks such as time-stretching or echo addition, with some attacks resistible via retraining and others remaining effective.
-
Non-invasive electromyographic speech neuroprosthesis: a geometric perspective
Direct sequence-to-sequence EMG-to-text conversion for silent articulation using a geometric representation of high-dimensional signals, without audio targets or time-alignment.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
R-Transformer: Recurrent Neural Network Enhanced Transformer
R-Transformer integrates RNNs with multi-head attention to model local and global sequence dependencies without position embeddings and reports large-margin gains over prior methods on diverse tasks.
-
A Bi-directional Transformer for Musical Chord Recognition
A bi-directional Transformer achieves competitive chord recognition by using self-attention to capture long-term dependencies in audio in a single training phase.
-
Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments
The MIA model with GC, RGA, and BFM modules achieves state-of-the-art performance on the CUHK-PEDES dataset for description-based person re-identification.
-
Atoms of Thought: Universal EEG Representation Learning with Microstates
Microstate tokenizer from clustered EEG signals provides universal representations that outperform traditional time- and frequency-domain features across sleep staging, emotion recognition, and motor imagery tasks.
-
Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning
An RL-based outer-loop quadrotor controller augmented with an online Residual Dynamics Predictor for disturbance estimation and a data-efficient sim-to-real calibration bridge.
-
Spatiotemporal decoupled physics-informed Stone-Weierstrass neural operator for long-time prediction of time-dependent parametric PDEs
A spatiotemporally decoupled physics-informed Stone-Weierstrass neural operator for stable long-time prediction of time-dependent parametric PDEs.
-
Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging
Randomly initialized Transformers act as adaptive sequence smoothers for sleep staging via a Random Attention Prior Kernel, with gains mainly from inductive bias rather than training.
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
Recurrent RL policies can have their hidden states aligned with PMP co-states through a derived loss, yielding robust performance on partially observable control tasks.
-
ReMedi: Reasoner for Medical Clinical Prediction
ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
-
HOI-aware Adaptive Network for Weakly-supervised Action Segmentation
AdaAct employs a HOI encoder and two-branch hypernetwork to adaptively adjust temporal encoding parameters based on video-level human-object interactions for improved weakly-supervised action segmentation.
-
STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation
STK-Adapter adds Spatial-Temporal MoE, Event-Aware MoE, and Cross-Modality Alignment MoE to integrate evolving TKG graphs and event chains into LLMs, reducing information loss and improving extrapolation performance over prior methods.
-
Gated Memory Policy
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
-
Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction
K2K framework enables internal memory retrieval in LLMs for healthcare outcome prediction, achieving state-of-the-art results on four benchmarks.
-
Adaptive Learned State Estimation based on KalmanNet
AM-KNet adds sensor-specific modules, hypernetwork conditioning on target type and pose, and Joseph-form covariance estimation to KalmanNet, yielding better accuracy and stability than base KalmanNet on nuScenes and View-of-Delft data.