arxiv: 2604.13521 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI

Recognition: unknown

C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions

Kenji Kubo , Shunsuke Kamiya , Masanori Koyama , Kohei Hayashi , Yusuke Iwasawa , Yutaka Matsuo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords C-votingtest-time scalingrecurrent modelsreasoning tasksSudokuMazeconfidence-based selection

0 comments

The pith

C-voting selects the most confident latent trajectory by averaging top-1 probabilities, enabling test-time scaling for recurrent models without energy functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes C-voting as a test-time strategy for recurrent models that perform iterative reasoning through repeated application of the same layers. It creates multiple candidate trajectories by random initialization of the latent state and selects the one with the highest average top-1 prediction probability over the output sequence. This matters for models like those solving Sudoku or Maze because it improves accuracy without requiring an explicit energy function, unlike prior voting approaches, and when combined with a new attention-based recurrent model it exceeds the performance of established baselines on extreme difficulty versions of these tasks.

Core claim

C-voting generates several candidate latent trajectories through random initialization and selects the trajectory that maximizes the average of the top-1 probabilities across the output sequence. This confidence measure identifies the most reliable reasoning path and works for any recurrent model with multiple trajectories, without depending on explicit energy functions. When paired with the introduced ItrSA++ model it delivers 95.2 percent accuracy on Sudoku-extreme and 78.6 percent on Maze, surpassing the Hierarchical Reasoning Model.

What carries the argument

C-voting, the selection of the latent trajectory maximizing average top-1 prediction probability across the output sequence.

If this is right

C-voting yields 4.9 percent higher accuracy on Sudoku-hard than energy-based voting.
Combined with ItrSA++, it reaches 95.2 percent accuracy on Sudoku-extreme compared to 55.0 percent for the Hierarchical Reasoning Model.
It reaches 78.6 percent accuracy on Maze compared to 74.5 percent for the Hierarchical Reasoning Model.
It applies directly to recurrent models that lack explicit energy functions.
It supports test-time scaling by increasing recurrent steps and using multiple random initializations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This selection rule might apply to other iterative models that produce multiple possible output paths even if they are not strictly recurrent.
Prediction confidence could serve as a substitute signal for correctness in a wider range of sequence generation tasks where energy is unavailable.
Increasing the number of candidate initializations beyond the values tested here may produce further gains on harder instances of the same tasks.

Load-bearing premise

The average of top-1 prediction probabilities across the output sequence reliably indicates which latent trajectory contains the correct reasoning path.

What would settle it

A set of Sudoku or Maze instances where the trajectory with the highest average top-1 probability produces an incorrect solution while a lower-confidence trajectory produces the correct one.

Figures

Figures reproduced from arXiv: 2604.13521 by Kenji Kubo, Kohei Hayashi, Masanori Koyama, Shunsuke Kamiya, Yusuke Iwasawa, Yutaka Matsuo.

**Figure 2.** Figure 2: A performance comparison between C-voting and E-voting in AKOrN. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Scaling analysis of ItrSA++ with C-voting. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Board accuracy for Sudoku-extreme task of HRM with C-voting. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of calibration of ItrSA++. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Distributions of confidence of ItrSA++ at [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Time step dependency of confidence. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Test-time scaling in ItrSA++ Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yua… view at source ↗

**Figure 9.** Figure 9: Board accuracy of a transformer with C-voting. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of voting effects across different metrics. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Random seed dependency of board accuracy. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

read the original abstract

Neural network models with latent recurrent processing, where identical layers are recursively applied to the latent state, have gained attention as promising models for performing reasoning tasks. A strength of such models is that they enable test-time scaling, where the models can enhance their performance in the test phase without additional training. Models such as the Hierarchical Reasoning Model (HRM) and Artificial Kuramoto Oscillatory Neurons (AKOrN) can facilitate deeper reasoning by increasing the number of recurrent steps, thereby enabling the completion of challenging tasks, including Sudoku, Maze solving, and AGI benchmarks. In this work, we introduce confidence-based voting (C-voting), a test-time scaling strategy designed for recurrent models with multiple latent candidate trajectories. Initializing the latent state with multiple candidates using random variables, C-voting selects the one maximizing the average of top-1 probabilities of the predictions, reflecting the model's confidence. Additionally, it yields 4.9% higher accuracy on Sudoku-hard than the energy-based voting strategy, which is specific to models with explicit energy functions. An essential advantage of C-voting is its applicability: it can be applied to recurrent models without requiring an explicit energy function. Finally, we introduce a simple attention-based recurrent model with randomized initial values named ItrSA++, and demonstrate that when combined with C-voting, it outperforms HRM on Sudoku-extreme (95.2% vs. 55.0%) and Maze (78.6% vs. 74.5%) tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

C-voting gives a simple, energy-free way to pick among recurrent trajectories by averaging top-1 probs, with reported gains on Sudoku and Maze, but the key assumption lacks direct support.

read the letter

The main thing here is C-voting: initialize multiple latent states randomly, run the recurrent model on each, then pick the trajectory whose sequence-averaged top-1 prediction probability is highest. This replaces energy-based voting and works on any recurrent model. They also introduce ItrSA++, a straightforward attention-based recurrent net, and show the combination beating HRM on Sudoku-extreme (95.2% vs 55%) and Maze (78.6% vs 74.5%), plus a 4.9% edge over energy voting on Sudoku-hard.

Referee Report

2 major / 0 minor

Summary. The paper proposes C-voting, a test-time scaling method for recurrent latent models that initializes multiple candidate trajectories from random variables and selects the trajectory maximizing the sequence-averaged top-1 prediction probability. It claims this yields a 4.9% accuracy improvement over energy-based voting on Sudoku-hard and, when combined with the introduced ItrSA++ model, achieves 95.2% vs. 55.0% on Sudoku-extreme and 78.6% vs. 74.5% on Maze relative to HRM, while remaining applicable to models lacking explicit energy functions.

Significance. If the central assumption holds, C-voting would provide a simple, energy-function-free alternative for test-time scaling in recurrent reasoning models, broadening applicability beyond specialized architectures like HRM or AKOrN and enabling gains on hard combinatorial tasks without additional training.

major comments (2)

[Abstract] Abstract: The reported performance deltas (4.9% on Sudoku-hard; 95.2% vs. 55.0% on Sudoku-extreme; 78.6% vs. 74.5% on Maze) are presented without any correlation analysis, calibration check, or ablation demonstrating that higher average top-1 probabilities reliably select the correct trajectory rather than an overconfident incorrect one; this assumption is load-bearing for all accuracy claims.
[Abstract] Abstract and experimental description: No information is supplied on the number of runs, statistical significance tests, baseline implementation details, or data splits underlying the accuracy figures, preventing assessment of whether the gains are robust or reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and outline revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The reported performance deltas (4.9% on Sudoku-hard; 95.2% vs. 55.0% on Sudoku-extreme; 78.6% vs. 74.5% on Maze) are presented without any correlation analysis, calibration check, or ablation demonstrating that higher average top-1 probabilities reliably select the correct trajectory rather than an overconfident incorrect one; this assumption is load-bearing for all accuracy claims.

Authors: We agree that direct validation of the selection criterion strengthens the claims. The manuscript presents the performance gains as supporting evidence for C-voting, but we will add an ablation study and correlation analysis in the revised version. This will include plots showing the relationship between average top-1 probability and trajectory correctness, as well as comparisons of C-voting against random selection and alternative metrics to confirm it does not favor overconfident errors. revision: yes
Referee: [Abstract] Abstract and experimental description: No information is supplied on the number of runs, statistical significance tests, baseline implementation details, or data splits underlying the accuracy figures, preventing assessment of whether the gains are robust or reproducible.

Authors: We apologize for the lack of explicit details in the abstract and experimental sections. We will revise the manuscript to include: results averaged over 5 independent runs with different random seeds and reported with standard deviations; paired t-tests for statistical significance of the reported gains; re-implementations of baselines (HRM, energy-based voting) following original hyperparameters and code where available; and use of standard data splits from prior work on Sudoku and Maze. A new reproducibility subsection will be added. revision: yes

Circularity Check

0 steps flagged

No circularity: C-voting is a direct definition with empirical validation

full rationale

The paper defines C-voting explicitly as selecting the latent trajectory that maximizes the sequence-averaged top-1 prediction probability. This is a straightforward algorithmic rule, not a derivation that reduces to its own inputs by construction. Performance claims (e.g., 4.9% gain over energy-based voting, outperformance vs. HRM) are presented as empirical outcomes on Sudoku and Maze benchmarks rather than as fitted parameters or self-referential predictions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the core method. The assumption that higher average top-1 probability tracks correctness is an unproven hypothesis but does not create circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that top-1 probability correlates with trajectory quality. No free parameters are introduced. The new entities are the C-voting procedure and the ItrSA++ model itself.

axioms (1)

domain assumption Higher average top-1 probability indicates a higher-quality latent trajectory in recurrent reasoning models
Invoked when defining the selection criterion for C-voting.

invented entities (2)

C-voting no independent evidence
purpose: Test-time selection among multiple recurrent trajectories
Newly proposed heuristic without prior independent evidence.
ItrSA++ no independent evidence
purpose: Attention-based recurrent model with randomized initial states
New architecture introduced to demonstrate C-voting.

pith-pipeline@v0.9.0 · 5590 in / 1460 out tokens · 51846 ms · 2026-05-10T12:59:40.308407+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 15 canonical work pages · 6 internal anchors

[1]

On the Measure of Intelligence

Franc ¸ois Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547,

work page internal anchor Pith review arXiv 1911
[2]

Arc- agi-2: A new challenge for frontier ai reasoning systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831,

work page arXiv
[3]

Continuous thought machines

Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. Continuous thought machines.arXiv preprint arXiv:2505.05522,

work page arXiv
[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

URLhttps: //openreview.net/forum?id=HyzdRiR9Y7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long ...

2019
[5]

arXiv:2407.05872 , year=

Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, et al. Scaling expo- nents across parameterizations and optimizers.arXiv preprint arXiv:2407.05872,

work page arXiv
[6]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.CoRR, abs/2502.05171, February

work page internal anchor Pith review arXiv
[7]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

URL https://doi.org/10.48550/arXiv.2502.05171. Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, and Tariq Iqbal. Energy-based transformers are scalable learners and thinkers,

work page internal anchor Pith review doi:10.48550/arxiv.2502.05171
[8]

Energy-based trans- formers are scalable learners and thinkers.arXiv preprint arXiv:2507.02092, 2025

URLhttps://arxiv.org/abs/2507.02092. Yunzhe Hu, Difan Zou, and Dong Xu. Hyper-SET: Designing transformers via hyperspherical energy minimization.arXiv [cs.LG],

work page arXiv
[9]

Analyzing and improving the training dynamics of diffusion models.arXiv preprint arXiv: 2312.02696,

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Ana- lyzing and improving the training dynamics of diffusion models.ArXiv, abs/2312.02696,

work page arXiv
[10]

Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks,

11 Published as a conference paper at ICLR 2026 Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise net- works.ArXiv, abs/2406.02596,

work page arXiv 2026
[11]

Siyuan Li, Zicheng Liu, Juanxi Tian, Ge Wang, Zedong Wang, Weiyang Jin, Di Wu, Cheng Tan, Tao Lin, Yang Liu, Baigui Sun, and Stan Z

URLhttps://api.semanticscholar.org/ CorpusID:270258586. Siyuan Li, Zicheng Liu, Juanxi Tian, Ge Wang, Zedong Wang, Weiyang Jin, Di Wu, Cheng Tan, Tao Lin, Yang Liu, Baigui Sun, and Stan Z. Li. Switch ema: A free lunch for better flatness and sharpness.ArXiv, abs/2402.09240,

work page arXiv
[12]

Reasoning with latent thoughts: On the power of looped transformers

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416,

work page arXiv
[13]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review arXiv 2002
[14]

Hierarchical Reasoning Model

URLhttps://arxiv.org/ abs/2506.21734. Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and log- ical reasoning using a differentiable satisfiability solver. InInternational Conference on Machine Learning, pp. 6545–6554. PMLR,

work page internal anchor Pith review arXiv
[15]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

12 Published as a conference paper at ICLR 2026 1 2 4 8 16 32 64 T 0 10 20 30 40 50Board Accuracy(%) Sudoku-hard (9x9) 1 2 4 8 16 32 T 0 10 20 30 40 50 60Board Accuracy(%) Sudoku-extreme (9x9) 1 2 4 8 16 32 64 T 0 20 40 60 80Board Accuracy(%) Maze-hard (30x30) Figure 8: Test-time scaling in ItrSA++ Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H....

2026
[16]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuur- mans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625,

work page internal anchor Pith review arXiv
[17]

Text infilling.arXiv preprint arXiv:1901.00158,

Wanrong Zhu, Zhiting Hu, and Eric Xing. Text infilling.arXiv preprint arXiv:1901.00158,

work page arXiv 1901
[18]

Figure 8 demonstrates that for Sudoku-hard, Sudoku-extreme, and Maze-hard tasks, board accuracy increases as the number of iterative steps grows in ItrSA++

A TEST-TIME SCALING INITRSA++ ItrSA++ has recursive structures similar to models such as the recurrent transformer (Geiping et al., 2025; Jaegle et al., 2021), and test-time scaling can be observed. Figure 8 demonstrates that for Sudoku-hard, Sudoku-extreme, and Maze-hard tasks, board accuracy increases as the number of iterative steps grows in ItrSA++. B...

2025
[19]

for ItrSA++. More precisely, during the training, we detach the gradient of the latent statez t att= 2for Sudoku, att= 14for 13 Published as a conference paper at ICLR 2026 1024128161 2048256 2 32 4096 4 512648 # of random samples 2.00 2.02 2.04 2.06 2.08 2.10 2.12 2.14Board Accuracy(%) Sudoku-hard (9x9) Figure 9: Board accuracy of a transformer with C-vo...

2026
[20]

1.0 Table 2: Hyperparameters for modified HRM

βfor Adam(0.9,0.95) Weight decay 1.0 Gradient clipping threshold 1.0 Learning rate1×10 −4 Warm-up steps 2000 Batch size 768 # of heads 8 Embedding dimension 512 Epochs 20000 # of H layers 4 # of L layers 4 Halt exploration prob. 1.0 Table 2: Hyperparameters for modified HRM. 14 Published as a conference paper at ICLR 2026 Parameter Value Optimizer Adam βf...

2000
[21]

This is thought to be because when the top-1 probability is dominant, there is little difference in ranking across metrics

For Sudoku-extreme, almost no difference is observed, and even in Maze-hard, the difference is only about 0.1%. This is thought to be because when the top-1 probability is dominant, there is little difference in ranking across metrics. On the other hand, since it facilitates analyses such as Equation 14, we adopt the top-1 probability. E PERFORMANCE DEPEN...

2026