CogScale: Scalable Benchmark for Sequence Processing

Romain de Coudenhove (ENS-PSL); Xavier Hinaut (Mnemosyne); Yannis Bendi-Ouis (Mnemosyne)

arxiv: 2605.19758 · v1 · pith:P5NTWM5Tnew · submitted 2026-05-19 · 💻 cs.AI · cs.DB· stat.ML

CogScale: Scalable Benchmark for Sequence Processing

Yannis Bendi-Ouis (Mnemosyne) , Romain de Coudenhove (ENS-PSL) , Xavier Hinaut (Mnemosyne) This is my paper

Pith reviewed 2026-05-20 05:08 UTC · model grok-4.3

classification 💻 cs.AI cs.DBstat.ML

keywords CogScalesequence processingsynthetic benchmarkattention mechanismsstate-space modelsrecurrent networksparameter budgetsmemory tasks

0 comments

The pith

CogScale benchmark shows attention and state-space models alone sustain performance as sequence reasoning complexity increases under fixed parameter limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CogScale, a collection of 14 synthetic tasks that can be adjusted in scale and difficulty to probe memory and cognitive skills in sequence models. It evaluates seven architectures including GRUs, LSTMs, Echo State Networks, Transformers, and Mamba under strict budgets of 1k, 10k, and 100k parameters. Results indicate that RNNs and echo state networks handle basic retention efficiently at small sizes, yet only attention mechanisms and modern state-space models keep accuracy high when tasks demand more complex manipulation over longer sequences. This approach matters because it gives researchers an inexpensive way to compare designs before committing to full-scale training runs.

Core claim

CogScale is built from 14 parametrizable synthetic tasks that isolate distinct memory and reasoning abilities. Testing GRU, LSTM, xLSTM, Echo State Network, Mamba, Transformer Decoder, and Transformer Encoder-Decoder under identical parameter budgets across increasing difficulty levels shows that classical RNNs and Echo State Networks perform strongly on simple retention, while attention-based models and Mamba are the only ones that maintain high performance as reasoning complexity and task difficulty grow.

What carries the argument

The CogScale benchmark of 14 scalable synthetic tasks that isolate specific cognitive and memory abilities for controlled comparisons across architectures at different parameter scales.

Load-bearing premise

The 14 synthetic tasks isolate cognitive and memory abilities in a manner representative of real sequence processing challenges and that fixed parameter budgets produce fair architecture comparisons.

What would settle it

If the relative performance ordering of the seven architectures reverses when the same models are run on large real-world sequence tasks such as long-context question answering, the benchmark's ability to predict scalable behavior would be falsified.

read the original abstract

The ability to maintain and manipulate information over time is a fundamental aspect of living beings and Artificial Intelligence. While modern models have achieved remarkable success in tasks like natural language processing, evaluating the capacity of novel architectures to process sequential information remains computationally expensive and time-consuming. Testing a new architecture often requires scaling up to massive datasets and models, leading to vast computational costs and slow iteration cycles. In this paper, we propose CogScale, a benchmark of 14 scalable synthetic tasks designed to isolate and evaluate specific cognitive and memory abilities at different parametrizable scales. By providing a standardized, lightweight framework, CogScale allows researchers to rapidly validate architectural innovations before committing to large-scale training. To establish a solid baseline, we evaluate seven distinct architectures: Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), xLSTM, Echo State Network (ESN), Mamba, Transformer Decoder, and Transformer Encoder-Decoder. These evaluations are conducted under strict parameter budgets (1k, 10k, and 100k) and across different difficulty levels and scales. Our results show that while classical RNNs and Echo State Networks excel at basic retention within strict parameter budgets, only attention mechanisms and modern state-space models consistently maintain high performance as reasoning complexity and task difficulty scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CogScale introduces a lightweight benchmark for sequence models but its scaling claims need more details on tasks and parameter controls to be convincing.

read the letter

The main takeaway is that this paper introduces CogScale as a benchmark of 14 scalable synthetic tasks for evaluating sequence processing abilities in models, with comparisons across seven architectures under fixed parameter budgets of 1k, 10k, and 100k. The abstract highlights that RNNs and ESNs do well on basic tasks but struggle as difficulty scales, unlike attention and state-space models. What stands out positively is the intent to create a lightweight, standardized way to test new architectures quickly without massive datasets or compute. By focusing on synthetic tasks that can be parametrized for different scales and difficulties, it aims to speed up iteration in areas like memory and temporal reasoning. Running the same set of models at controlled parameter counts provides a fairer baseline than many ad-hoc comparisons. That said, the soft spots are noticeable given what's available. The abstract does not include the specific task definitions, how difficulty is increased through parameters like sequence length or dependency depth, or the exact methods for matching parameter budgets across different architectures. Without these, it's difficult to confirm that the performance differences reflect true architectural advantages rather than variations in effective capacity or task-specific artifacts. The stress-test note points to this exact issue, and it aligns with the limited information here, so the central scaling claim remains provisional. This paper seems geared toward researchers developing or comparing sequence models who need quicker ways to validate ideas. Someone looking for benchmarks in cognitive or memory-related AI tasks might find the concept helpful once more details are filled in. I would recommend sending it for peer review. The benchmark idea addresses a real need for efficient evaluation, and referees could help strengthen the methods and analysis sections to make the results more convincing.

Referee Report

1 major / 0 minor

Summary. The paper introduces CogScale, a benchmark of 14 scalable synthetic tasks designed to isolate specific cognitive and memory abilities for sequence processing at controllable scales. It evaluates seven architectures (GRU, LSTM, xLSTM, ESN, Mamba, Transformer Decoder, Transformer Encoder-Decoder) under strict parameter budgets of 1k, 10k, and 100k parameters across varying difficulty levels. The central claim is that classical RNNs and Echo State Networks perform well on basic retention within these budgets, but only attention mechanisms and modern state-space models maintain high performance as reasoning complexity and task difficulty increase.

Significance. If the results hold, CogScale would offer a lightweight, standardized framework for rapid validation of sequence-processing architectures, reducing the need for expensive large-scale training and enabling faster iteration on innovations. The explicit focus on isolating abilities and enforcing parameter budgets is a constructive contribution toward fair architectural comparisons.

major comments (1)

[Abstract] Abstract: The manuscript states comparative results showing that 'only attention mechanisms and modern state-space models consistently maintain high performance as reasoning complexity and task difficulty scale' yet supplies no task definitions, no parametrization of difficulty (e.g., sequence length, dependency depth, or noise), no metrics, no statistical details, and no description of how each architecture is configured to meet the exact 1k/10k/100k parameter counts. These omissions are load-bearing for the scaling claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. We address the single major comment below and commit to revisions that strengthen the clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript states comparative results showing that 'only attention mechanisms and modern state-space models consistently maintain high performance as reasoning complexity and task difficulty scale' yet supplies no task definitions, no parametrization of difficulty (e.g., sequence length, dependency depth, or noise), no metrics, no statistical details, and no description of how each architecture is configured to meet the exact 1k/10k/100k parameter counts. These omissions are load-bearing for the scaling claim.

Authors: We agree that the abstract, being a concise summary, omits the detailed task definitions, parametrization of difficulty, metrics, statistical details, and exact configuration methods for the parameter budgets. These elements are fully specified in the Methods, Benchmark Tasks, and Experimental Setup sections of the manuscript. To directly address the concern that the omissions undermine the scaling claim, we will revise the abstract to include brief descriptions of the 14 tasks, how difficulty is scaled (via controllable parameters such as sequence length, dependency depth, and added noise), the primary metrics (accuracy and retention scores), and a note that all architectures are configured using standard hyperparameter adjustments to match the exact 1k/10k/100k parameter counts (with precise implementation details provided in the full text). We will also add explicit cross-references to the relevant sections. This revision will make the central claim more self-contained while preserving the abstract's brevity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluations independent of inputs

full rationale

The provided abstract and full text contain no equations, derivations, first-principles predictions, or parameter-fitting steps that could reduce to self-definition or fitted inputs by construction. The paper defines a benchmark of 14 synthetic tasks and reports direct empirical results from evaluating seven architectures under fixed parameter budgets (1k/10k/100k). Performance claims follow from running the models on the tasks at varying difficulty levels, with no load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results presented as novel derivations. The benchmark definition and evaluation protocol are independent of the reported outcomes, making the derivation chain self-contained with no circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the contribution is an empirical benchmark and set of baseline evaluations.

pith-pipeline@v0.9.0 · 5746 in / 1151 out tokens · 78351 ms · 2026-05-20T05:08:32.644895+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

benchmark of 14 scalable synthetic tasks designed to isolate and evaluate specific cognitive and memory abilities at different parametrizable scales... under strict parameter budgets (1k, 10k, and 100k)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.