arxiv: 1803.01271 · v2 · submitted 2018-03-04 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

J. Zico Kolter, Shaojie Bai, Vladlen Koltun

Pith reviewed 2026-05-11 19:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords sequence modelingconvolutional networksrecurrent networksLSTMbenchmark evaluationtemporal modelingeffective memory

0 comments

The pith

A simple convolutional architecture outperforms LSTMs on diverse sequence tasks while showing longer effective memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a systematic comparison of generic convolutional and recurrent networks on standard sequence modeling benchmarks. It finds that a basic convolutional model surpasses canonical recurrent networks such as LSTMs in performance across the tested tasks. The convolutional approach also maintains longer effective memory without relying on recurrence. This evidence questions the default link between sequence modeling and recurrent architectures. The authors therefore position convolutional networks as a natural starting point for new sequence tasks.

Core claim

The authors evaluate generic convolutional and recurrent architectures across a broad range of standard sequence modeling tasks used to benchmark recurrent networks. Their results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs, while demonstrating longer effective memory. They conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and that convolutional networks should be regarded as a natural starting point for sequence modeling tasks.

What carries the argument

A simple convolutional architecture that processes sequences in parallel and captures long-range dependencies without recurrence.

Load-bearing premise

The chosen tasks and datasets represent general sequence modeling challenges and both architectures are implemented and compared fairly without hidden advantages.

What would settle it

A standard sequence modeling benchmark where an LSTM records higher accuracy or F1 score than the convolutional model, or where the convolutional model's effective memory span proves shorter than the LSTM's.

read the original abstract

For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper's abstract makes the case that a basic convolutional model beats LSTMs on many standard sequence benchmarks and handles longer memory, backed by released code.

read the letter

The main takeaway is that a straightforward dilated convolutional network can outperform canonical LSTMs across a range of sequence tasks while showing better effective memory length. The authors ran a head-to-head comparison on the kinds of benchmarks that RNN papers usually use, and they conclude that conv nets deserve to be the default starting point instead of recurrence. They also point to a GitHub repo with the code, which is useful for anyone who wants to inspect or extend the work. That combination of broad testing and public implementation is what stands out here. The evaluation covers enough different domains to make the result feel less like a one-off, and the memory claim directly addresses a known RNN limitation. If the numbers hold, it gives practitioners a simpler alternative for tasks where training stability and parallelization matter. The soft spots are mostly about what is missing from the abstract. There is no information on how the models were sized, what hyperparameter search was done for each side, or whether the recurrent baselines received the same level of tuning as the convolutional one. Without the tables or architecture diagrams it is impossible to judge whether the wins are robust or sensitive to small implementation choices. The tasks are described as standard, but that still leaves open whether they capture the hardest long-range dependency cases that come up in practice. These are normal gaps at the abstract stage rather than fatal problems. This paper is aimed at people who build or choose sequence models for NLP, audio, or time series and who are willing to try a non-recurrent option. A reader looking for empirical reasons to move away from LSTMs will find the comparisons worth examining, even if they ultimately keep their current architecture. I would send it to peer review. The central question is practical and the empirical framing means referees can check the details and the code once the full version is available.

Referee Report

1 major / 0 minor

Summary. The manuscript conducts a systematic empirical evaluation of generic convolutional and recurrent architectures for sequence modeling on standard benchmark tasks. It claims that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets while demonstrating longer effective memory, and concludes that convolutional networks should be regarded as a natural starting point for sequence modeling tasks. Code is made available at a GitHub repository.

Significance. If the reported results hold, the work would be significant in challenging the default association of sequence modeling with recurrent networks and in providing evidence for convolutional architectures as a competitive or superior alternative, supported by the public code release which enables reproducibility and independent verification.

major comments (1)

[Abstract] Abstract: The central claims of outperformance over LSTMs and longer effective memory are presented as direct results without any quantitative metrics, tables, task descriptions, implementation details, hyperparameter protocols, or statistical tests in the provided manuscript; these elements are load-bearing for assessing whether the data support the conclusion that convolutional networks should replace recurrent ones as the default.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for clarity on how the abstract relates to the supporting evidence in the manuscript. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of outperformance over LSTMs and longer effective memory are presented as direct results without any quantitative metrics, tables, task descriptions, implementation details, hyperparameter protocols, or statistical tests in the provided manuscript; these elements are load-bearing for assessing whether the data support the conclusion that convolutional networks should replace recurrent ones as the default.

Authors: The abstract is a concise summary of the paper's primary conclusions, following standard academic conventions that reserve detailed evidence for the main text. The full manuscript supplies all requested elements: task and dataset descriptions appear in Section 4, the experimental protocol including implementation details and hyperparameter search is given in Section 5, quantitative results with direct LSTM comparisons are reported in Tables 1--3 and Figures 2--4, and the longer effective memory analysis is presented in Section 5.3 with supporting dilation and receptive-field experiments. Performance differences are shown consistently across multiple independent runs and diverse benchmarks, supporting the claim that the convolutional architecture is a competitive starting point. The public code release enables independent verification of all reported numbers. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical study that performs a systematic comparison of convolutional and recurrent architectures on standard sequence modeling benchmarks. No derivations, first-principles results, fitted parameters presented as predictions, or self-referential equations appear in the abstract or described methodology. Claims rest on direct experimental measurements against external tasks and datasets, with code released for reproducibility. No load-bearing steps reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described beyond standard neural network components and benchmark tasks.

pith-pipeline@v0.9.0 · 5423 in / 1056 out tokens · 69730 ms · 2026-05-11T19:30:04.145631+00:00 · methodology

discussion (0)

Forward citations

Cited by 44 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficiently Modeling Long Sequences with Structured State Spaces
cs.LG 2021-10 unverdicted novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation
cs.LG 2026-05 unverdicted novelty 7.0

U-STS-LLM uses a spatio-temporally steered LLM with dynamic attention bias generation to achieve state-of-the-art results on long-horizon traffic forecasting and high-missing-rate imputation while remaining parameter-...
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
cs.LG 2026-05 unverdicted novelty 7.0

Clin-JEPA supplies a multi-phase co-training method for JEPA pretraining on EHR trajectories that achieves converging latent rollouts and improved multi-task AUROC on MIMIC-IV data.
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition
cs.HC 2026-05 unverdicted novelty 7.0

SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
AegisTS: A Hierarchical Agent System with Reinforcement Learning for Multivariate Time Series Data Cleaning
cs.DB 2026-05 unverdicted novelty 7.0

AegisTS uses a two-level RL agent architecture with a dual-stage reward to jointly optimize cleaning order and method selection for multivariate time series, delivering up to 96% better cleaning quality and 27% better...
BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton
cs.CV 2026-05 unverdicted novelty 7.0

BadmintonGRF is a new public multimodal dataset and benchmark that pairs multi-view video with instrumented GRF for markerless load estimation in badminton.
GAFSV-Net: A Vision Framework for Online Signature Verification
cs.CV 2026-04 unverdicted novelty 7.0

GAFSV-Net encodes online signatures as asymmetric Gramian Angular Field images and processes them with dual-branch ConvNeXt plus cross-attention to outperform sequence-based baselines on DeepSignDB and BiosecurID.
Autocorrelation Reintroduces Spectral Bias in KANs for Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 7.0

Temporal autocorrelation reintroduces spectral bias in KANs for time series forecasting, which DCT preprocessing can mitigate.
A Convolutional Neural Network-Derived Catalog of Solar Flares from Soft X-Ray Observations
astro-ph.SR 2026-04 unverdicted novelty 7.0

The CNN-derived catalog detects over seven times more solar flares than the GOES catalog and extends the power-law distribution of flare peak fluxes to smaller sizes.
Adversarial Robustness of Deep State Space Models for Forecasting
cs.LG 2026-04 conditional novelty 7.0

Spacetime SSM forecasters represent optimal Kalman predictors for autoregressive data but remain vulnerable to model-free attacks that exploit local linearity and increase error by over 33% compared to projected gradi...
Self-Supervised Foundation Model for Calcium-imaging Population Dynamics
q-bio.QM 2026-04 unverdicted novelty 7.0

CalM uses a discrete tokenizer and dual-axis autoregressive transformer pretrained self-supervised on calcium traces to outperform specialized baselines on population dynamics forecasting and adapt to superior behavio...
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
cs.LG 2022-11 conditional novelty 7.0

PatchTST uses subseries patching and channel-independent Transformers to deliver significantly better long-term multivariate time series forecasting and strong self-supervised transfer performance.
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

A five-phase co-training framework enables stable JEPA pretraining on EHR trajectories, producing converging latent rollouts and higher multi-task AUROC than baselines on MIMIC-IV ICU data.
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
cs.LG 2026-05 unverdicted novelty 6.0

MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
Dynamics Aware Quadrupedal Locomotion via Intrinsic Dynamics Head
cs.RO 2026-05 unverdicted novelty 6.0

Concurrent training of an Intrinsic Dynamics Head with a dynamics reward yields more efficient and smoother quadrupedal locomotion policies that transfer to real robots with 12-18% gains in efficiency metrics.
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
cs.LG 2026-05 unverdicted novelty 6.0

GCL uses a two-stage protocol with Routing, Auditing, Public-Factor, and Aggregation Agents to mitigate modality dominance and spurious coupling in multimodal learning, achieving state-of-the-art results on CMU-MOSI, ...
WISE-FM:Operation-Aware, Engineering-Informed Foundation Model for Multi-Task Well Design
cs.LG 2026-04 unverdicted novelty 6.0

WISE-FM is a design-aware, physics-informed multi-task foundation model that reduces virtual flow metering error by up to 13x on simulated wells and transfers to real Equinor data with high R-squared values by conditi...
Data-Driven Open-Loop Simulation for Digital-Twin Operator Decision Support in Wastewater Treatment
cs.LG 2026-04 unverdicted novelty 6.0

CCSS-RS achieves RMSE 0.696 and CRPS 0.349 at 1000-step horizons on a large public WWTP benchmark with 43% missingness, outperforming Neural CDE baselines by 40-46% in RMSE.
Conditional Attribution for Root Cause Analysis in Time-Series Anomaly Detection
cs.LG 2026-04 unverdicted novelty 6.0

Conditional attribution retrieves contextually similar normal states from VAE latent spaces and UMAP embeddings to explain time-series anomalies while preserving dependencies, improving root-cause accuracy on SWaT and...
VQ-Wave: A physics-driven spatio-temporal deep learning approach for non-contrast-enhanced lung ventilation and perfusion MRI
physics.med-ph 2026-04 unverdicted novelty 6.0

VQ-Wave is a physics-driven spatio-temporal neural network that learns to extract ventilation and perfusion maps from non-contrast lung MRI by training on simulated signals with amplitude modulations, frequency drifts...
Modern Structure-Aware Simplicial Spatiotemporal Neural Network
cs.LG 2026-04 unverdicted novelty 6.0

ModernSASST is the first simplicial complex-based spatiotemporal model that combines random walks on high-dimensional complexes with parallelizable temporal convolutional networks for efficient high-order topology capture.
AIBuildAI: An AI Agent for Automatically Building AI Models
cs.AI 2026-04 unverdicted novelty 6.0

AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games
cs.AI 2026-04 unverdicted novelty 6.0

MISID is a multimodal multi-turn dataset for intent recognition in strategic deception games, paired with the FRACTAM framework that improves MLLM performance on hidden intent detection via decouple-anchor-reason steps.
BiTA: Bidirectional Gated Recurrent Unit-Transformer Aggregator in a Temporal Graph Network Framework for Alert Prediction in Computer Networks
cs.LG 2026-04 unverdicted novelty 6.0

BiTA redesigns temporal aggregation in TGNs by jointly using bidirectional GRU for sequential dependencies and Transformer for long-range context to improve alert prediction accuracy on real network data.
A General Framework for Generative Self-supervised Learning in Non-invasive Estimation of Physiological Parameters Using Photoplethysmography
eess.SP 2026-04 unverdicted novelty 6.0

TS2TC combines cross-temporal fusion generative anchor pretraining with dual-process transfer to achieve 2.49% lower RMSE than prior methods on PPG parameter estimation using only 10% labeled data.
ROMAN: A Multiscale Routing Operator for Convolutional Time Series Models
cs.LG 2026-04 conditional novelty 6.0

ROMAN converts time series into a shorter multiscale channel representation that lets standard CNN classifiers access scale and coarse-position information explicitly.
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
cs.LG 2023-10 unverdicted novelty 6.0

By applying attention and feed-forward networks to inverted variate tokens instead of temporal tokens, iTransformer achieves state-of-the-art performance on real-world time series forecasting datasets.
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
cs.LG 2026-05 unverdicted novelty 5.0

PBKV predicts agent invocations in dynamic LLM workflows to manage KV-cache reuse, delivering up to 1.85x speedup over LRU and 1.26x over KVFlow.
Perceive, Route and Modulate: Dynamic Pattern Recalibration for Time Series Forecasting
cs.LG 2026-05 unverdicted novelty 5.0

Dynamic Pattern Recalibration (DPR) adds a perceive-route-modulate pipeline that generates time-aware modulation vectors to recalibrate hidden states in forecasting models, improving performance across architectures w...
Graph Neural Ordinary Differential Equations for Power System Identification
eess.SY 2026-04 unverdicted novelty 5.0

MPG-NODEs identify power system dynamics more flexibly than standard neural ODEs by using graph message passing, enabling transfer learning for adding or removing lines and units.
Looking Into the Past: Eye Movements Characterize Elements of Autobiographical Recall in Interviews with Holocaust Survivors
cs.MM 2026-04 unverdicted novelty 5.0

Eye movements during Holocaust survivor interviews vary by episodic, semantic, affective and temporal memory dimensions, with pre-onset gaze sufficient to predict sentence temporal context.
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective
cs.LG 2026-04 unverdicted novelty 5.0

CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and recon...
MambaSL: Exploring Single-Layer Mamba for Time Series Classification
cs.LG 2026-04 unverdicted novelty 5.0

A single-layer Mamba variant with targeted redesigns sets new state-of-the-art average performance on all 30 UEA time series classification datasets under a unified reproducible protocol.
Degradation-aware Predictive Energy Management for Fuel Cell-Battery Ship Power System with Data-driven Load Forecasting
eess.SY 2026-04 unverdicted novelty 5.0

A degradation-aware predictive controller for hybrid ship power systems reduces hydrogen consumption by up to 5.8% and fuel cell degradation by up to 36.4% versus a filter-based benchmark on real harbor tug data.
Multimodal Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions
cs.CV 2026-04 unverdicted novelty 5.0

Multimodal deep learning for ambivalence/hesitancy recognition in videos yields limited results on the BAH dataset, highlighting the need for improved spatio-temporal and cross-modal fusion methods.
MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition
cs.CV 2026-04 unverdicted novelty 5.0

MSGL-Transformer reaches 75.4% accuracy on RatSI and 87.1% on CalMS21 for rodent behavior classification, beating TCN, LSTM, and several graph-based baselines.
Transformer-Based Wildlife Species Classification from Daily Movement Trajectories
cs.LG 2026-05 unverdicted novelty 4.0

Transformer models classify seven wildlife species from daily GPS trajectories, outperforming LSTM, CNN, and TCN baselines by 8-22 percentage points in balanced accuracy under region-holdout evaluation.
CNN-based Multi-In-Multi-Out Model for Efficient Spatiotemporal Prediction
cs.CV 2026-05 unverdicted novelty 4.0

MIMO-ESP is a new CNN-Transformer hybrid architecture for spatiotemporal prediction that claims to handle global information efficiently, keep time independent, and outperform prior models on video, traffic, and preci...
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
cs.LG 2026-05 unverdicted novelty 4.0

Group Cognition Learning uses governed two-stage agents after separate modality encoding to mitigate dominance and spurious coupling, reporting state-of-the-art results on CMU-MOSI, CMU-MOSEI, and MIntRec for regressi...
Electricity price forecasting across Norway's five bidding zones in the post-crisis era
cs.LG 2026-04 unverdicted novelty 4.0

LightGBM delivers the strongest forecasts for Norwegian electricity prices across zones, with simple lagged-price and calendar models often sufficient but regime analysis improving error stratification.
Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis
cs.MM 2026-04 unverdicted novelty 4.0

The Dual-Branch Rebalancing Framework (DBR) mitigates shared-private branch imbalance in multimodal sentiment analysis via Temporal-Structural Factorization, Anchor-Guided Private Routing, and Bidirectional Rebalancin...
Interpretable Physics-Informed Load Forecasting for U.S. Grid Resilience: SHAP-Guided Ensemble Validation in Hybrid Deep Learning Under Extreme Weather
cs.LG 2026-04 unverdicted novelty 4.0

A hybrid deep learning model with physics regularization and SHAP analysis achieves 1.18% MAPE on ERCOT load data and up to 40.5% better performance on extreme events than its individual branches.
Causal-Transformer with Adaptive Mutation-Locking for Early Prediction of Acute Kidney Injury
cs.LG 2026-04 unverdicted novelty 4.0

CT-Former integrates continuous-time modeling and causal attention in a transformer to deliver accurate, interpretable early AKI prediction on the MIMIC-IV cohort of 18,419 patients.
The CTLNet for Shanghai Composite Index Prediction
q-fin.ST 2026-04 reject novelty 3.0

CTLNet hybrid model outperforms listed baselines on Shanghai Composite Index prediction task.