arxiv: 2604.24809 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.AI

Recognition: unknown

Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

Maixent Chenebaux

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Nautile-370MSeqCond AttentionSCAhybrid language modelspectral sequence operatorsmall reasoning modelattention expressivenesscontinuous limit

0 comments

The pith

SCA can exactly retrieve any individual token from the prefix summary and reproduce any softmax attention output as a special case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Nautile-370M, a 371-million-parameter language model whose backbone alternates two SeqCond Attention layers with one standard transformer layer. It proves that the SCA readout mechanism exactly retrieves any token from the prefix summary and reproduces any softmax attention output, showing that SCA is at least as expressive as full self-attention in the continuous limit. This hybrid structure seeks to combine the linear-time efficiency and state-tracking of spectral operators with attention's token routing, allowing capable reasoning under tight parameter and hardware constraints.

Core claim

We prove that the SCA readout mechanism can exactly retrieve any individual token from the prefix summary and can reproduce any output of softmax attention as a special case, establishing that SCA is at least as expressive as full self-attention in the continuous limit. The Nautile-370M model alternates two SCA layers with one transformer layer in its backbone and was trained with a reinforcement learning stage focused on reasoning and verification.

What carries the argument

SeqCond Attention (SCA), a linear-time spectral sequence operator whose readout mechanism performs exact token retrieval from a prefix summary and replicates softmax attention outputs.

If this is right

The hybrid model retains long-context efficiency and state-tracking benefits of structured sequential models.
It preserves the expressive token-to-token routing of attention within a smaller parameter budget.
Training remains feasible on limited hardware including a single TPU v4-64 pod slice and one NVIDIA DGX Spark.
A specialized reinforcement learning stage can improve reasoning, verification, and response quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the discrete implementation holds up, SCA layers could substitute for attention in other small models to cut inference cost without losing capability.
The hybrid pattern may improve state tracking over long sequences compared with pure transformer stacks.
Similar spectral operators could be tested in other efficiency-focused architectures for reasoning tasks.
Extending the exact-retrieval proof beyond the continuous limit would strengthen practical guarantees.

Load-bearing premise

The continuous limit in the proof corresponds to the discrete token sequences used in actual language models and the practical SCA implementation preserves the exact retrieval property.

What would settle it

A calculation or experiment on a discrete token sequence showing that SCA fails to retrieve a chosen token from the prefix summary or cannot match a specific softmax attention output.

read the original abstract

We present Nautile-370M, a 371-million-parameter small language model designed for efficient reasoning under strict parameter and inference budgets. Nautile-370M uses a hybrid backbone in which two SeqCond Attention (SCA) layers, a linear-time spectral sequence operator inspired by SeqCondenser, alternate with one transformer layer. This design aims to retain the long-context efficiency and state-tracking benefits of structured sequential models while preserving the expressive token-to-token routing of attention. The model was trained on a single Cloud TPU v4-64 pod slice provided through the Google TPU Research Cloud (TRC) program; the subsequent reinforcement learning stage was carried out on a single NVIDIA DGX Spark. We prove that the SCA readout mechanism can exactly retrieve any individual token from the prefix summary and can reproduce any output of softmax attention as a special case, establishing that SCA is at least as expressive as full self-attention in the continuous limit. We also describe the training data pipeline and outline a reinforcement learning stage specialized for reasoning, verification, and response quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nautile-370M gives a concrete hybrid of SCA layers with transformers but the continuous-limit expressiveness proof does not yet connect to the discrete 370M implementation.

read the letter

The main takeaway is that this 371M model alternates two SCA layers with one standard transformer layer to target linear-time reasoning while trying to keep attention-level routing. They back it with a claim that the SCA readout can exactly pull any prefix token and match softmax attention outputs as a special case, but only in the continuous limit. The training on a TPU slice plus a follow-up RL stage for reasoning quality is a practical detail that stands out for a small model paper. The specific alternation pattern and the retrieval equivalence statement are the clearest new pieces here, building on earlier SeqCondenser ideas without simply repeating them. The model ships with a described data pipeline, which adds some reproducibility value. The soft spot is the missing link between the continuous proof and the actual discrete, finite-precision setup. No error bounds or empirical checks appear for how well the trained SCA layers preserve exact retrieval on real token sequences, so the efficiency advantages rest on an unbridged assumption. That matches the stress-test concern and keeps the central claim from fully supporting the hybrid backbone's edge over plain attention. The abstract also skips the proof steps, leaving soundness hard to judge without the full derivation. This paper is aimed at researchers working on efficient small-scale reasoning models who want architecture ideas that mix spectral operators with attention. A reader building resource-constrained systems could extract the hybrid pattern and training outline even if they treat the theory as preliminary. It deserves a serious referee because it delivers a trained model with explicit claims and hardware details that can be reviewed for both the architecture and the theory gap. I would send it to peer review after the authors add approximation analysis or practical validation for the discrete case.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Nautile-370M, a 371-million-parameter language model with a hybrid backbone in which two SeqCond Attention (SCA) layers alternate with one standard transformer layer. It asserts a proof that the SCA readout mechanism exactly retrieves any individual token from the prefix summary and reproduces any softmax-attention output as a special case, thereby establishing that SCA is at least as expressive as full self-attention in the continuous limit. The paper also describes the training data pipeline, a reinforcement-learning stage for reasoning and verification, and the hardware used (TPU v4-64 pod slice followed by NVIDIA DGX Spark).

Significance. If the continuous-limit expressivity result can be shown to transfer to the discrete finite-precision setting, the hybrid design would offer a concrete route to long-context efficiency while retaining token-to-token routing power in small models. The manuscript is credited for attempting a theoretical characterization of SCA expressiveness and for documenting the training infrastructure and RL stage. The significance remains provisional because the central theoretical claim is not yet connected to the actual discrete implementation.

major comments (2)

[Abstract and expressiveness proof section] Abstract and the section presenting the SCA expressiveness proof: the claim that SCA 'can exactly retrieve any individual token from the prefix summary and can reproduce any output of softmax attention as a special case' is established only in the continuous limit. The manuscript provides no derivation steps, stated assumptions, or error bounds showing how this equivalence survives discretization, finite embedding dimension, learned parameters, and floating-point arithmetic in the 370M model.
[Architecture and training description] The section describing the Nautile-370M architecture and training: no quantification of the approximation error between the continuous SCA operator and its discrete implementation is given, nor is any empirical check reported that the trained SCA layers retain the exact retrieval property on real token sequences. This gap directly affects whether the theoretical equivalence supports the claimed advantages of the hybrid backbone over standard attention.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important gaps between the continuous-limit theoretical claim and the discrete implementation, which we will address through targeted revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and expressiveness proof section] Abstract and the section presenting the SCA expressiveness proof: the claim that SCA 'can exactly retrieve any individual token from the prefix summary and can reproduce any output of softmax attention as a special case' is established only in the continuous limit. The manuscript provides no derivation steps, stated assumptions, or error bounds showing how this equivalence survives discretization, finite embedding dimension, learned parameters, and floating-point arithmetic in the 370M model.

Authors: We agree that the equivalence is established strictly in the continuous limit, as stated in the manuscript. The current version states the result without full derivation steps or explicit assumptions. In the revision we will expand the relevant section to include the step-by-step derivation of the continuous-limit result, list the key assumptions (continuous sequence representation, infinite-resolution spectral operator), and add a discussion of the approximation under discretization, finite embedding dimension, and floating-point effects, noting that learned parameters and empirical long-context performance provide supporting evidence while acknowledging the remaining gap. revision: yes
Referee: [Architecture and training description] The section describing the Nautile-370M architecture and training: no quantification of the approximation error between the continuous SCA operator and its discrete implementation is given, nor is any empirical check reported that the trained SCA layers retain the exact retrieval property on real token sequences. This gap directly affects whether the theoretical equivalence supports the claimed advantages of the hybrid backbone over standard attention.

Authors: We acknowledge that the manuscript does not quantify the approximation error or report an empirical check of the retrieval property on discrete token sequences. The revision will add a new subsection providing a theoretical bound on the discretization error derived from the spectral operator and include empirical verification results testing whether the trained SCA layers can retrieve specific prefix tokens on held-out sequences. These additions will more directly link the continuous-limit theory to the observed behavior of the hybrid 370M model. revision: yes

Circularity Check

0 steps flagged

No circularity: expressiveness established by explicit proof rather than definition or self-citation

full rationale

The paper's central derivation is the stated proof that the SCA readout exactly retrieves prefix tokens and reproduces softmax attention outputs as a special case, but only in the continuous limit. This is presented as a mathematical result derived from the operator definition, not as a property assumed in the definition itself or fitted from data. No equations in the provided text reduce the claimed retrieval property to a tautology or to a prior self-citation that itself lacks independent verification. The hybrid architecture description, training pipeline, and RL stage are separate from this proof. The continuous-limit assumption is stated openly and does not create a self-referential loop. Per the rules, absent a quotable reduction of the target claim to its own inputs by construction, the finding is no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the definition of the SCA operator and the continuous-limit assumption for the equivalence proof; no explicit free parameters or fitted values are stated in the abstract.

axioms (1)

ad hoc to paper The SCA readout mechanism exactly retrieves any token from the prefix summary in the continuous limit
This is the load-bearing statement of the proof asserted in the abstract without further derivation or external reference.

invented entities (1)

SeqCond Attention (SCA) no independent evidence
purpose: Linear-time spectral sequence operator that alternates with transformer layers
New operator introduced in the hybrid backbone; no independent evidence of its properties is supplied beyond the abstract claim.

pith-pipeline@v0.9.0 · 5481 in / 1320 out tokens · 35330 ms · 2026-05-08T04:36:01.849852+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 11 canonical work pages · 9 internal anchors

[1]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zeiler, Sumit Sanghai, and Yi Tay. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

2023
[2]

SeqCondenser: Inductive representation learning 15 of sequences by sampling characteristic functions

Maixent Chenebaux and Tristan Cazenave. SeqCondenser: Inductive representation learning 15 of sequences by sampling characteristic functions. InText, Speech, and Dialogue (TSD 2024), volume 15048 ofLecture Notes in Computer Science, pages 1–12. Springer, 2024

2024
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review arXiv 2021
[4]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review arXiv 2023
[5]

Efficiently modeling long sequences with struc- tured state spaces

Albert Gu, Karan Goel, and Christopher R´ e. Efficiently modeling long sequences with struc- tured state spaces. InInternational Conference on Learning Representations, 2022

2022
[6]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review arXiv 2015
[7]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review arXiv 2024
[8]

Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026a

Wenlong Liu et al. Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2025

work page arXiv 2025
[9]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Fan, Tianyu Liu, Ruoxi Zheng, Hang Luo, Wai-kin Lam, Siamak Rajmohan, Qing Zhang, et al. Understanding R1-Zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review arXiv 2025
[10]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydl´ ıˇ cek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557, 2024

work page internal anchor Pith review arXiv 2024
[11]

SYNTH: A large-scale synthetic reasoning dataset.https://huggingface.co/ datasets/PleIAs/SYNTH, 2024

PleIAs. SYNTH: A large-scale synthetic reasoning dataset.https://huggingface.co/ datasets/PleIAs/SYNTH, 2024. Hugging Face dataset

2024
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y K Li, Y Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review arXiv 2024
[13]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review arXiv 2002
[14]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review arXiv 2021
[15]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017

2017
[16]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022. 16

2022
[17]

STaR: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, volume 35, pages 15476–15488, 2022

2022
[18]

Available: https://arxiv.org/abs/1910.07467

Biao Zhang and Rico Sennrich. Root mean square layer normalization.arXiv preprint arXiv:1910.07467, 2019. 17

work page arXiv 1910