Recognition: unknown
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
Pith reviewed 2026-05-10 12:59 UTC · model grok-4.3
The pith
C-voting selects the most confident latent trajectory by averaging top-1 probabilities, enabling test-time scaling for recurrent models without energy functions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
C-voting generates several candidate latent trajectories through random initialization and selects the trajectory that maximizes the average of the top-1 probabilities across the output sequence. This confidence measure identifies the most reliable reasoning path and works for any recurrent model with multiple trajectories, without depending on explicit energy functions. When paired with the introduced ItrSA++ model it delivers 95.2 percent accuracy on Sudoku-extreme and 78.6 percent on Maze, surpassing the Hierarchical Reasoning Model.
What carries the argument
C-voting, the selection of the latent trajectory maximizing average top-1 prediction probability across the output sequence.
If this is right
- C-voting yields 4.9 percent higher accuracy on Sudoku-hard than energy-based voting.
- Combined with ItrSA++, it reaches 95.2 percent accuracy on Sudoku-extreme compared to 55.0 percent for the Hierarchical Reasoning Model.
- It reaches 78.6 percent accuracy on Maze compared to 74.5 percent for the Hierarchical Reasoning Model.
- It applies directly to recurrent models that lack explicit energy functions.
- It supports test-time scaling by increasing recurrent steps and using multiple random initializations.
Where Pith is reading between the lines
- This selection rule might apply to other iterative models that produce multiple possible output paths even if they are not strictly recurrent.
- Prediction confidence could serve as a substitute signal for correctness in a wider range of sequence generation tasks where energy is unavailable.
- Increasing the number of candidate initializations beyond the values tested here may produce further gains on harder instances of the same tasks.
Load-bearing premise
The average of top-1 prediction probabilities across the output sequence reliably indicates which latent trajectory contains the correct reasoning path.
What would settle it
A set of Sudoku or Maze instances where the trajectory with the highest average top-1 probability produces an incorrect solution while a lower-confidence trajectory produces the correct one.
Figures
read the original abstract
Neural network models with latent recurrent processing, where identical layers are recursively applied to the latent state, have gained attention as promising models for performing reasoning tasks. A strength of such models is that they enable test-time scaling, where the models can enhance their performance in the test phase without additional training. Models such as the Hierarchical Reasoning Model (HRM) and Artificial Kuramoto Oscillatory Neurons (AKOrN) can facilitate deeper reasoning by increasing the number of recurrent steps, thereby enabling the completion of challenging tasks, including Sudoku, Maze solving, and AGI benchmarks. In this work, we introduce confidence-based voting (C-voting), a test-time scaling strategy designed for recurrent models with multiple latent candidate trajectories. Initializing the latent state with multiple candidates using random variables, C-voting selects the one maximizing the average of top-1 probabilities of the predictions, reflecting the model's confidence. Additionally, it yields 4.9% higher accuracy on Sudoku-hard than the energy-based voting strategy, which is specific to models with explicit energy functions. An essential advantage of C-voting is its applicability: it can be applied to recurrent models without requiring an explicit energy function. Finally, we introduce a simple attention-based recurrent model with randomized initial values named ItrSA++, and demonstrate that when combined with C-voting, it outperforms HRM on Sudoku-extreme (95.2% vs. 55.0%) and Maze (78.6% vs. 74.5%) tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes C-voting, a test-time scaling method for recurrent latent models that initializes multiple candidate trajectories from random variables and selects the trajectory maximizing the sequence-averaged top-1 prediction probability. It claims this yields a 4.9% accuracy improvement over energy-based voting on Sudoku-hard and, when combined with the introduced ItrSA++ model, achieves 95.2% vs. 55.0% on Sudoku-extreme and 78.6% vs. 74.5% on Maze relative to HRM, while remaining applicable to models lacking explicit energy functions.
Significance. If the central assumption holds, C-voting would provide a simple, energy-function-free alternative for test-time scaling in recurrent reasoning models, broadening applicability beyond specialized architectures like HRM or AKOrN and enabling gains on hard combinatorial tasks without additional training.
major comments (2)
- [Abstract] Abstract: The reported performance deltas (4.9% on Sudoku-hard; 95.2% vs. 55.0% on Sudoku-extreme; 78.6% vs. 74.5% on Maze) are presented without any correlation analysis, calibration check, or ablation demonstrating that higher average top-1 probabilities reliably select the correct trajectory rather than an overconfident incorrect one; this assumption is load-bearing for all accuracy claims.
- [Abstract] Abstract and experimental description: No information is supplied on the number of runs, statistical significance tests, baseline implementation details, or data splits underlying the accuracy figures, preventing assessment of whether the gains are robust or reproducible.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and outline revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported performance deltas (4.9% on Sudoku-hard; 95.2% vs. 55.0% on Sudoku-extreme; 78.6% vs. 74.5% on Maze) are presented without any correlation analysis, calibration check, or ablation demonstrating that higher average top-1 probabilities reliably select the correct trajectory rather than an overconfident incorrect one; this assumption is load-bearing for all accuracy claims.
Authors: We agree that direct validation of the selection criterion strengthens the claims. The manuscript presents the performance gains as supporting evidence for C-voting, but we will add an ablation study and correlation analysis in the revised version. This will include plots showing the relationship between average top-1 probability and trajectory correctness, as well as comparisons of C-voting against random selection and alternative metrics to confirm it does not favor overconfident errors. revision: yes
-
Referee: [Abstract] Abstract and experimental description: No information is supplied on the number of runs, statistical significance tests, baseline implementation details, or data splits underlying the accuracy figures, preventing assessment of whether the gains are robust or reproducible.
Authors: We apologize for the lack of explicit details in the abstract and experimental sections. We will revise the manuscript to include: results averaged over 5 independent runs with different random seeds and reported with standard deviations; paired t-tests for statistical significance of the reported gains; re-implementations of baselines (HRM, energy-based voting) following original hyperparameters and code where available; and use of standard data splits from prior work on Sudoku and Maze. A new reproducibility subsection will be added. revision: yes
Circularity Check
No circularity: C-voting is a direct definition with empirical validation
full rationale
The paper defines C-voting explicitly as selecting the latent trajectory that maximizes the sequence-averaged top-1 prediction probability. This is a straightforward algorithmic rule, not a derivation that reduces to its own inputs by construction. Performance claims (e.g., 4.9% gain over energy-based voting, outperformance vs. HRM) are presented as empirical outcomes on Sudoku and Maze benchmarks rather than as fitted parameters or self-referential predictions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the core method. The assumption that higher average top-1 probability tracks correctness is an unproven hypothesis but does not create circularity in the reported results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Higher average top-1 probability indicates a higher-quality latent trajectory in recurrent reasoning models
invented entities (2)
-
C-voting
no independent evidence
-
ItrSA++
no independent evidence
Reference graph
Works this paper leans on
-
[1]
On the Measure of Intelligence
Franc ¸ois Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547,
work page internal anchor Pith review arXiv 1911
-
[2]
Arc- agi-2: A new challenge for frontier ai reasoning systems
Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831,
-
[3]
Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. Continuous thought machines.arXiv preprint arXiv:2505.05522,
-
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
URLhttps: //openreview.net/forum?id=HyzdRiR9Y7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long ...
2019
-
[5]
Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, et al. Scaling expo- nents across parameterizations and optimizers.arXiv preprint arXiv:2407.05872,
-
[6]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.CoRR, abs/2502.05171, February
work page internal anchor Pith review arXiv
-
[7]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
URL https://doi.org/10.48550/arXiv.2502.05171. Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, and Tariq Iqbal. Energy-based transformers are scalable learners and thinkers,
work page internal anchor Pith review doi:10.48550/arxiv.2502.05171
-
[8]
Energy-based trans- formers are scalable learners and thinkers.arXiv preprint arXiv:2507.02092, 2025
URLhttps://arxiv.org/abs/2507.02092. Yunzhe Hu, Difan Zou, and Dong Xu. Hyper-SET: Designing transformers via hyperspherical energy minimization.arXiv [cs.LG],
-
[9]
Analyzing and improving the training dynamics of diffusion models.arXiv preprint arXiv: 2312.02696,
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Ana- lyzing and improving the training dynamics of diffusion models.ArXiv, abs/2312.02696,
-
[10]
Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks,
11 Published as a conference paper at ICLR 2026 Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise net- works.ArXiv, abs/2406.02596,
-
[11]
URLhttps://api.semanticscholar.org/ CorpusID:270258586. Siyuan Li, Zicheng Liu, Juanxi Tian, Ge Wang, Zedong Wang, Weiyang Jin, Di Wu, Cheng Tan, Tao Lin, Yang Liu, Baigui Sun, and Stan Z. Li. Switch ema: A free lunch for better flatness and sharpness.ArXiv, abs/2402.09240,
-
[12]
Reasoning with latent thoughts: On the power of looped transformers
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416,
-
[13]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review arXiv 2002
-
[14]
URLhttps://arxiv.org/ abs/2506.21734. Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and log- ical reasoning using a differentiable satisfiability solver. InInternational Conference on Machine Learning, pp. 6545–6554. PMLR,
work page internal anchor Pith review arXiv
-
[15]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
12 Published as a conference paper at ICLR 2026 1 2 4 8 16 32 64 T 0 10 20 30 40 50Board Accuracy(%) Sudoku-hard (9x9) 1 2 4 8 16 32 T 0 10 20 30 40 50 60Board Accuracy(%) Sudoku-extreme (9x9) 1 2 4 8 16 32 64 T 0 20 40 60 80Board Accuracy(%) Maze-hard (30x30) Figure 8: Test-time scaling in ItrSA++ Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H....
2026
-
[16]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuur- mans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625,
work page internal anchor Pith review arXiv
-
[17]
Text infilling.arXiv preprint arXiv:1901.00158,
Wanrong Zhu, Zhiting Hu, and Eric Xing. Text infilling.arXiv preprint arXiv:1901.00158,
-
[18]
Figure 8 demonstrates that for Sudoku-hard, Sudoku-extreme, and Maze-hard tasks, board accuracy increases as the number of iterative steps grows in ItrSA++
A TEST-TIME SCALING INITRSA++ ItrSA++ has recursive structures similar to models such as the recurrent transformer (Geiping et al., 2025; Jaegle et al., 2021), and test-time scaling can be observed. Figure 8 demonstrates that for Sudoku-hard, Sudoku-extreme, and Maze-hard tasks, board accuracy increases as the number of iterative steps grows in ItrSA++. B...
2025
-
[19]
for ItrSA++. More precisely, during the training, we detach the gradient of the latent statez t att= 2for Sudoku, att= 14for 13 Published as a conference paper at ICLR 2026 1024128161 2048256 2 32 4096 4 512648 # of random samples 2.00 2.02 2.04 2.06 2.08 2.10 2.12 2.14Board Accuracy(%) Sudoku-hard (9x9) Figure 9: Board accuracy of a transformer with C-vo...
2026
-
[20]
1.0 Table 2: Hyperparameters for modified HRM
βfor Adam(0.9,0.95) Weight decay 1.0 Gradient clipping threshold 1.0 Learning rate1×10 −4 Warm-up steps 2000 Batch size 768 # of heads 8 Embedding dimension 512 Epochs 20000 # of H layers 4 # of L layers 4 Halt exploration prob. 1.0 Table 2: Hyperparameters for modified HRM. 14 Published as a conference paper at ICLR 2026 Parameter Value Optimizer Adam βf...
2000
-
[21]
This is thought to be because when the top-1 probability is dominant, there is little difference in ranking across metrics
For Sudoku-extreme, almost no difference is observed, and even in Maze-hard, the difference is only about 0.1%. This is thought to be because when the top-1 probability is dominant, there is little difference in ranking across metrics. On the other hand, since it facilitates analyses such as Equation 14, we adopt the top-1 probability. E PERFORMANCE DEPEN...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.