CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
super hub Mixed citations
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Mixed citation behavior. Most common role is background (62%).
abstract
In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our experiments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be comparable to LSTM.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our experiments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be comparable to LSTM.
authors
co-cited works
representative citing papers
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
CPF-GCD enforces low-rank compositional structure on vision backbone features via spatial primitive fields so that novel categories emerge as new activation patterns over a shared vocabulary of reusable visual primitives.
Presents UKHD, the first historical offline Urdu handwritten text lines dataset from Katib materials, and benchmarks CRNN-based models with CNN-BGRU-CTC showing lowest CER and WER.
LongSpike integrates fractional-order state-space modeling into spiking neural networks, enabling better long-sequence performance than prior SNNs on LRA, WikiText-103, and Speech Commands benchmarks while retaining sparse computation.
CoMetaPNS combines meta-learned neural surrogates with a continual Bayesian Gaussian Mixture Model to adapt cardiac electrophysiology simulations to new data while avoiding catastrophic forgetting.
AdaState replaces the static first-frame KV anchor with an evolving hidden latent that the model denoises alongside content, treating time as relative to enable recurrence and richer dynamics in streaming video generation.
LC-Flow introduces a continuous local recurrent network for learning sparse optical flow and confidence directly from event streams, with confidence-guided aggregation reaching new SOTA on MVSEC.
Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-global logarithm resummation in the large-Nc limit.
A two-stage contrastive teacher-student framework learns and then projects latent dynamics onto port-Hamiltonian submanifolds from partial observations.
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.
The What-Where Transformer achieves explicit what-where separation in a ViT-style backbone via concurrent token and attention-map streams, yielding emergent object discovery from attention maps and better weakly-supervised localization.
TailedTS supplies 24.69 billion Wikipedia page-view records as a public benchmark for heavy-tailed time series forecasting and periodicity analysis, revealing weaker periodic structure in high-traffic pages.
TCRTransBench provides a new benchmark with bidirectional TCR-peptide generation tasks, a large validated dataset, and metrics to evaluate neural models for immunological sequence modeling.
NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.
CLOVER augments value decomposition with a GNN mixer whose weights depend on the realized wireless communication graph, proving permutation invariance, monotonicity, and greater expressiveness than QMIX while showing gains on Predator-Prey and Lumberjacks under p-CSMA channels.
Damped harmonic oscillators with closed-form solutions model keys, values, and queries in continuous attention for irregular time series, preserving universal approximation while being orders of magnitude faster than prior NODE-based methods.
NeuroKalman mitigates state drift in vision-language UAV navigation by using memory-augmented Kalman filtering where attention retrieves historical anchors to correct predictions without gradient updates.
ExDoS uses expert-guided dual-focus distillation between source semantic graphs and bytecode control-flow graphs plus a dual-attention network to improve smart contract vulnerability detection, reporting 3-6% F1 gains over baselines.
Unsupervised GNN model learns local updates for approximate MaxIS on dynamic graphs, achieving competitive ratios on 200-1000 node instances and 1.00-1.18x larger solutions than other unsupervised models when generalizing to 100x larger graphs.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
Characterizes an estimation-prediction tradeoff in binary logistic models for causal probabilistic temporal graphs and proposes a framework to jointly evaluate temporal link prediction with causal parameter recovery via Cramér-Rao bounds.
citing papers explorer
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning
LongSpike integrates fractional-order state-space modeling into spiking neural networks, enabling better long-sequence performance than prior SNNs on LRA, WikiText-103, and Speech Commands benchmarks while retaining sparse computation.
-
CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations
CoMetaPNS combines meta-learned neural surrogates with a continual Bayesian Gaussian Mixture Model to adapt cardiac electrophysiology simulations to new data while avoiding catastrophic forgetting.
-
Identify Then Project: Contrastive Learning of Latent Dynamics from Partial Observations with Port-Hamiltonian Structure
A two-stage contrastive teacher-student framework learns and then projects latent dynamics onto port-Hamiltonian submanifolds from partial observations.
-
Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction
PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.
-
TailedTS: Benchmark Dataset for Heavy-Tailed Time Series Prediction and Periodicity Quantification
TailedTS supplies 24.69 billion Wikipedia page-view records as a public benchmark for heavy-tailed time series forecasting and periodicity analysis, revealing weaker periodic structure in high-traffic pages.
-
Learning to Theorize the World from Observation
NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.
-
Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning
CLOVER augments value decomposition with a GNN mixer whose weights depend on the realized wireless communication graph, proving permutation invariance, monotonicity, and greater expressiveness than QMIX while showing gains on Predator-Prey and Lumberjacks under p-CSMA channels.
-
Oscillators Are All You Need: Irregular Time Series Modelling via Damped Harmonic Oscillators with Closed-Form Solutions
Damped harmonic oscillators with closed-form solutions model keys, values, and queries in continuous attention for irregular time series, preserving universal approximation while being orders of magnitude faster than prior NODE-based methods.
-
Unsupervised Learning of Local Updates for Maximum Independent Set in Dynamic Graphs
Unsupervised GNN model learns local updates for approximate MaxIS on dynamic graphs, achieving competitive ratios on 200-1000 node instances and 1.00-1.18x larger solutions than other unsupervised models when generalizing to 100x larger graphs.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
-
Estimation--Prediction Tradeoff in Causal Probabilistic Temporal Graphs
Characterizes an estimation-prediction tradeoff in binary logistic models for causal probabilistic temporal graphs and proposes a framework to jointly evaluate temporal link prediction with causal parameter recovery via Cramér-Rao bounds.
-
Kolmogorov-Arnold Reservoir Computing
KARC is a lightweight KAN-style reservoir that admits closed-form training and outperforms standard reservoir computing on PDE benchmarks at comparable cost.
-
Pretraining Recurrent Networks without Recurrence
SMT reduces RNN training to supervised learning on memory transitions (m_t, x_{t+1}) to m_{t+1} obtained from a Transformer encoder, enabling time-parallel training with O(1) gradient paths.
-
Generating Financial Time Series by Matching Random Convolutional Features
Introduces SOCK (SOft Competing Kernels), a differentiable random convolutional feature map, to train generative models of financial time series via feature matching and shows outperformance over signature and diffusion baselines on small-sample datasets.
-
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
-
Learning to Test: Physics-Informed Representation for Dynamical Instability Detection
A physics-informed neural representation is learned from safe data to support distributional hypothesis testing for dynamical instability in stochastic DAE systems without repeated simulations.
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
-
Short window attention enables long-term memorization
Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.
-
Flow marching for a generative PDE foundation model
Flow Marching jointly samples noise and physical time to learn a velocity field for generative PDE modeling, paired with a latent autoencoder and efficient transformer for large-scale pretraining on 2.5M trajectories.
-
Logo-LLM: Local and Global Modeling with Large Language Models for Time Series Forecasting
Logo-LLM improves time series forecasting by pulling local dynamics from shallow LLM layers and global trends from deeper layers, then aligning them via new Local-Mixer and Global-Mixer modules.
-
R-Transformer: Recurrent Neural Network Enhanced Transformer
R-Transformer integrates RNNs with multi-head attention to model local and global sequence dependencies without position embeddings and reports large-margin gains over prior methods on diverse tasks.
-
Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter
CA-NKCF is a hybrid neural-Kalman consensus filter for distributed state estimation that operates without noise covariance knowledge and shows robustness to model misspecification in linear, chaotic, and wireless scenarios.
-
Interpretable Kolmogorov-Arnold Network with Feature-Isolated Temporal Attention Mechanism for Electricity Load Forecasting
LoadKAN combines feature-isolated temporal attention with KAN to produce competitive load forecasts on three U.S. markets and enables quantitative analysis of non-linear mobility-load relationships via learned activation functions.
-
Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring
GGATN combines graph grounding with transformer self- and cross-attention to generate full event sequences, timestamps, length, and attributes in a single pass followed by Viterbi-style constrained decoding, outperforming prompted LLM baselines on six logs with zero hallucinated activities.
-
Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations
A GRU encoder using static acoustic, dynamic, and interaction features from 53 dyads predicts cognitive load related to time pressure, mental work, effort, and performance, with turn-taking linked to temporal demand and imbalanced participation to mental demand.
-
Boosting ECG Classification Performance by Pre-training with Synthesized Data
Pre-training ten DNN architectures on knowledge-driven synthetic ECGs generated via Gaussian PQRST wave composition improves classification of AF, AFLT, PVC, and WPW, with largest gain of 33.2% for AFLT and stronger benefits on smaller real datasets.
-
EvoCSFL: Surrogate-Assisted Evolutionary Client Selection for Efficient and Robust Federated Learning
EvoCSFL combines candidate generation, a multi-objective metric, surrogate approximation, and evolutionary search to optimize client subsets in federated learning, reporting faster convergence and lower energy on image classification tasks.
-
HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care
HoT-SSM combines hypergraph construction from domain knowledge with a dynamic state space model to jointly capture higher-order clinical interactions and long-range temporal dependencies, yielding improved predictions on MIMIC-III and MIMIC-IV.
-
RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities
RePercENT introduces a plug-and-play self-supervised framework for scalable pairwise disentangled representation learning across more than two modalities using pre-extracted embeddings and a joint optimization objective with theoretical optimality guarantees.
-
Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions
GTR introduces a bounded non-monotonic Gaussian trust region and Mixture Gaussian Anchor to enable effective behavior transitions in non-stationary RL where standard PPO fails.
-
CHAM-net: A Contrastive Hierarchical Adaptive Meta-network for Robust Global Methane Flux Prediction
CHAM-net is a contrastive hierarchical adaptive meta-network that conditions predictions on historical site data to outperform baselines on methane flux tasks with nRMSE down to 0.43.
-
Atoms of Thought: Universal EEG Representation Learning with Microstates
Microstate tokenizer from clustered EEG signals provides universal representations that outperform traditional time- and frequency-domain features across sleep staging, emotion recognition, and motor imagery tasks.
-
Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging
Randomly initialized Transformers act as adaptive sequence smoothers for sleep staging via a Random Attention Prior Kernel, with gains mainly from inductive bias rather than training.
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
Recurrent RL policies can have their hidden states aligned with PMP co-states through a derived loss, yielding robust performance on partially observable control tasks.
-
Probabilistic Hysteresis Factor Prediction for Electric Vehicle Batteries with Graphite Anodes Containing Silicon
A data-driven probabilistic approach predicts the hysteresis factor for silicon-graphite anode batteries in electric vehicles, with tests for generalization across vehicle models.
-
WaveletInception Networks for on-board Vibration-Based Infrastructure Health Monitoring
The WaveletInception-BiGRU network uses learnable wavelet packet transforms, 1D Inception-ResNet modules, and BiGRU layers to generate high-resolution, spatially mapped health profiles from variable-speed vibration data, outperforming prior methods on track stiffness and transition zone tasks.
-
Latent Multi-Criteria Ratings for Recommendations
Uses variational autoencoders on user reviews to generate latent multi-criteria ratings that outperform baselines on multiple datasets.
-
Improving Patient Subtyping on Longitudinal Data using Representations from Mamba-based Architecture
Self-supervised Mamba model learns EHR representations that improve patient subtyping on longitudinal data compared to baselines.
-
Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information
Ambient sound and light data from ICU rooms predict delirium with AUC 0.80 using convolutional neural networks, with sound as the dominant predictor, on data from 309 patients across 9 ICUs.
-
Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation
Frozen Chronos-2 TSFM embeddings plus a lightweight regression head outperform standard baselines for RUL estimation on two industrial sensor datasets.
-
Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection
CoAD unifies outlier exposure classification and masked autoencoder reconstruction in a cooperative loop to detect subtle and prolonged time series anomalies.
-
Physics-based Digital Twins for Integrated Thermal Energy Systems Using Active Learning
Active learning with physics-informed surrogates achieves comparable accuracy for a glycol heat exchanger digital twin using only one-fifth the high-fidelity simulation trajectories needed by random sampling.
-
Deep Learning for Time Series Forecasting: The Electric Load Case
Compares feedforward, recurrent, sequence-to-sequence and temporal convolutional neural networks for short-term electric load forecasting through experiments on two real datasets.
-
Rare Disease Detection by Sequence Modeling with Generative Adversarial Networks
A GAN-boosted RNN model reaches 0.56 PR-AUC for rare EPI detection on 1.8 million patients and outperforms benchmarks.
-
Multiplicative Models for Recurrent Language Modeling
New multiplicative RNN models are tested on char-level LM tasks to demonstrate the relevance of shared parametrization for the intermediate state.
-
JourneyFormer: Encoding Airbnb Guest Journey with Sequence Modeling
JourneyFormer applies sequence modeling to Airbnb guest event sequences for production search ranking, with design choices for long sparse data and reported business metric improvements via A/B testing.
-
Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers
Empirical comparison of LSTM, GNN, and Transformer architectures for NBA trajectory forecasting finds hybrid LSTM with contextual information yields lowest FDE of 1.51m over horizons up to 2s.