Does the Same Token Mean the Same State? MoE Routing as Signal for Reasoning Control

Junjie Nian; Kang Chen; Minshen Yu; Yaoning Wang; Yixin Cao; Yugang Jiang

arxiv: 2606.22798 · v1 · pith:IRWAUJQOnew · submitted 2026-06-22 · 💻 cs.CL

Does the Same Token Mean the Same State? MoE Routing as Signal for Reasoning Control

Kang Chen , Minshen Yu , Junjie Nian , Yaoning Wang , Yixin Cao , Yugang Jiang This is my paper

Pith reviewed 2026-06-26 08:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords mixture of expertsrouting statesreasoning controlanswer selectiontest-time decodingboundary anchorsdelimiter anchorsweighted jaccard

0 comments

The pith

The same token in MoE models activates different experts based on context, so routing states at anchors can select correct reasoning paths without reading the answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse MoE language models show that fixing the token id at anchors does not fix the router state. The experts still encode task context and reasoning mode. This residual structure lets routing neighborhoods at boundary and delimiter anchors align with final-answer basins. RAD operationalizes this by selecting the rollout whose anchor-window routing is the densest center in Weighted-Jaccard K-NN space. It matches majority voting performance while working on tasks where answer strings cannot be voted on.

Core claim

Holding the emitted token id fixed at repeated anchors, the experts that produce it still separate task context, trajectory history, and reasoning-effort mode. Near boundary anchors and delimiter anchors, routing neighborhoods already align with final-answer basins at a marker-only readout, strongest when read at the answer opening.

What carries the argument

RAD (Routing Agreement Decoding): locate a fixed anchor, represent each rollout by anchor-window MoE routing states, return the densest Weighted-Jaccard K-NN route-basin center.

If this is right

RAD performs on par with majority voting (73.9 vs 73.6) across 10 MoE configs and 6 datasets without using answer strings.
It provides a direct pass@1 selector for code generation where exact-string voting is ill-defined.
Re-anchoring the routing-density principle to the agentic boundary improves best-of-16 patch selection on SWE-bench Verified over random.
RAD is not a verifier and can still select a dense wrong basin.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routing states might allow internal monitoring of reasoning effort without external tools.
The approach could extend to selecting among multiple agent trajectories or plans in multi-step tasks.
Testing RAD on non-MoE models or other routing mechanisms would check if the signal is specific to sparse experts.

Load-bearing premise

MoE routing states observed at a fixed anchor window are stable and task-discriminative enough to identify the correct answer basin without any access to the generated token sequence or external verification.

What would settle it

Finding a set of rollouts where the routing-based selector picks the wrong basin more frequently than string majority voting across the tested datasets would show the alignment does not hold.

read the original abstract

In sparse Mixture-of-Experts language models, does the same token id imply the same router state and the same experts producing it? Holding the emitted token id fixed at repeated anchors, we find it does not: the experts that produce it still separate task context, trajectory history, and reasoning-effort mode. This residual structure supports test-time control: near \emph{boundary} anchors (the final-response transition) and \emph{delimiter} anchors (which open the answer, e.g.\ \texttt{\textbackslash boxed\{} or code fences), routing neighborhoods already align with final-answer basins at a marker-only readout and strongest when the routing is read at the answer opening. We operationalize this as \textbf{RAD} (Routing Agreement Decoding), an answer-string-free multi-rollout selector: it locates a fixed anchor, represents each rollout by its anchor-window MoE routing states, and returns the densest Weighted-Jaccard $K$-NN route-basin center, without parsing, normalizing, executing, or voting over answer strings. Across 10 sparse-MoE configurations (gpt-oss, Qwen3-MoE) and 6 datasets spanning math, GPQA, and code, RAD is on par with Majority where string voting is well-posed, with small positive paired deltas (RAD $73.9$ / RAD+DC $74.2$ vs.\ Majority $73.6$). Like majority voting, RAD is not a verifier: a dense \emph{wrong} basin can still win. Its value is the interface: the same selector gives direct pass@1 on code, where exact-string voting is ill-defined, and the same routing-density principle, re-anchored to the agentic boundary, improves best-of-16 patch selection on SWE-bench Verified over random, where patches have no answer string to vote on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoE routing vectors at fixed answer anchors give a workable string-free rollout selector that matches majority voting on average, with the main value being its use on code and patches where strings can't be voted.

read the letter

The core observation is that in sparse MoE models the router state at a repeated token is not fixed; it still encodes task context and reasoning mode. The authors turn this into RAD, which reads routing neighborhoods only at boundary or delimiter anchors and picks the densest Weighted-Jaccard basin across rollouts. No answer text is parsed or compared.

What stands out is the practical interface. On math, GPQA, and code tasks across ten MoE setups it lands at 73.9 (RAD) and 74.2 (RAD+DC) against 73.6 for majority, and the same density rule improves best-of-16 patch selection on SWE-bench over random. That is the part worth noticing: a selector that works when exact-string voting is ill-defined.

The gains are small and the method can still select a dense wrong basin, which the abstract states plainly. The abstract gives aggregate numbers but no variance, no statistical tests, and no detail on whether anchor windows were chosen after seeing results. If routing states shift substantially across rollouts the K-NN step could be selecting noise rather than signal; the paper would need to show stability metrics to make the claim solid.

This is for groups already running multi-rollout inference on MoE models and looking for cheap ways to filter outputs in code or agent settings. The experiments are broad enough and the idea is concrete enough that it should go to referees rather than desk rejection, even though the effect size is modest and further validation on routing stability would strengthen it.

Referee Report

2 major / 1 minor

Summary. The paper claims that in sparse MoE language models the same token ID does not imply the same router state: routing at repeated boundary and delimiter anchors separates task context, trajectory history, and reasoning-effort mode. It introduces RAD (Routing Agreement Decoding), a string-free multi-rollout selector that represents each rollout by its anchor-window routing states and returns the densest Weighted-Jaccard K-NN route-basin center; across 10 MoE configurations and 6 datasets RAD reports 73.9 (RAD+DC 74.2) versus Majority 73.6 and extends the same principle to code and SWE-bench patch selection.

Significance. If the routing states at fixed anchors prove stable and sufficiently discriminative, RAD supplies a practical interface for test-time control that does not require answer-string parsing or voting, which is directly useful for code generation and agentic settings where exact-string majority is ill-defined. The multi-configuration, multi-dataset evaluation is a concrete strength.

major comments (2)

[Abstract] Abstract: the aggregate claim of small positive deltas (RAD 73.9 / RAD+DC 74.2 vs. Majority 73.6) across 10 configurations and 6 datasets supplies no variance, statistical tests, data-split details, or confirmation that anchor selection was pre-specified rather than post-hoc; this information is load-bearing for the assertion that routing neighborhoods already align with correct final-answer basins at a marker-only readout.
[RAD definition and experimental section] RAD definition and experimental section: the selector is defined directly from the observed routing vectors and their Weighted-Jaccard density; the manuscript does not report independent metrics of cluster stability across rollouts or correlation between routing basins and correctness independent of the final answer string, leaving the central assumption that fixed-anchor states are task-discriminative without token access unverified.

minor comments (1)

[Method] Notation for the Weighted-Jaccard K-NN distance and the precise window size around boundary/delimiter anchors should be stated explicitly in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the aggregate claim of small positive deltas (RAD 73.9 / RAD+DC 74.2 vs. Majority 73.6) across 10 configurations and 6 datasets supplies no variance, statistical tests, data-split details, or confirmation that anchor selection was pre-specified rather than post-hoc; this information is load-bearing for the assertion that routing neighborhoods already align with correct final-answer basins at a marker-only readout.

Authors: The experimental section of the manuscript already reports per-configuration and per-dataset breakdowns together with the exact data splits and model configurations used. Anchor selection (boundary and delimiter tokens) was fixed in advance on the basis of earlier pilot observations of MoE routing behavior and was not tuned on the final test sets. Nevertheless, the abstract itself presents only aggregate figures. In the revision we will (i) add a parenthetical note on variance and the paired statistical tests that were performed, (ii) explicitly state that anchor positions were pre-specified, and (iii) reference the supplementary tables that contain the full per-run statistics. revision: partial
Referee: [RAD definition and experimental section] RAD definition and experimental section: the selector is defined directly from the observed routing vectors and their Weighted-Jaccard density; the manuscript does not report independent metrics of cluster stability across rollouts or correlation between routing basins and correctness independent of the final answer string, leaving the central assumption that fixed-anchor states are task-discriminative without token access unverified.

Authors: The primary evidence offered is that RAD, which uses only routing states at fixed anchors, matches or slightly exceeds string-based majority voting across ten model configurations and six tasks. This performance parity supplies indirect support for the claim that routing neighborhoods align with answer correctness. We agree, however, that direct, answer-string-independent diagnostics would strengthen the argument. In the revised experimental section we will therefore add (a) intra- and inter-cluster similarity statistics on the routing vectors themselves and (b) a correlation analysis between basin density and correctness computed after removing any reference to the generated answer strings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; RAD is an empirical definition from observed routing vectors

full rationale

The paper reports an empirical observation that routing states at fixed anchors separate task context/trajectory/mode, then directly defines RAD as the densest Weighted-Jaccard K-NN center in that routing space. No equations, fitted parameters, or self-citations reduce the selector to its own inputs by construction. The method is presented as an operationalization of the observed alignment, with performance compared to majority voting on external datasets. This matches the default expectation of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method rests on the empirical observation that routing states separate context.

pith-pipeline@v0.9.1-grok · 5891 in / 1145 out tokens · 19832 ms · 2026-06-26T08:57:17.707972+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 3 canonical work pages

[1]

Modelutilitylaw: Evaluating LLMs beyond performance through mechanism interpretable metric

YixinCao,JiahaoYing,YaoningWang,XipengQiu,XuanjingHuang,andYugangJiang. Modelutilitylaw: Evaluating LLMs beyond performance through mechanism interpretable metric. arXiv preprint arXiv:2504.07440, 2025. URL https://arxiv.org/abs/2504.07440

arXiv 2025
[2]

Do LLMs signal when they’re right? evidence from neuron agreement, 2025

Kang Chen, Yaoning Wang, Kai Xiong, Zhuoka Feng, Wenhe Sun, Haotian Chen, and Yixin Cao. Do LLMs signal when they’re right? evidence from neuron agreement, 2025. URLhttps://arxiv.org/abs/2510.26277

arXiv 2025
[3]

NEX: Neuron explore-exploit scoring for label-free chain-of-thought selection and model ranking

Kang Chen, Zhuoka Feng, Sihan Zhao, Kai Xiong, Junjie Nian, Yaoning Wang, Changyi Xiao, and Yixin Cao. NEX: Neuron explore-exploit scoring for label-free chain-of-thought selection and model ranking. arXiv preprint arXiv:2602.05805, 2026. URLhttps://arxiv.org/abs/2602.05805

arXiv 2026
[4]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.x. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.k. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association...

work page doi:10.18653/v1/2024.acl-long.70 2024
[5]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022. URLhttps://jmlr.org/ papers/v23/21-0998.html

2022
[6]

Deep think with confidence

Yichao Fu, Xuewei Wang, Hao Zhang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=8LqHs0KIM7

2026
[7]

Layer-wise MoE routing locality under shared-prefix code generation: Token-identity decomposition and compile-equivalent fork redundancy

Shun-ichiro Hayashi, Daichi Mukunoki, Tetsuya Hoshino, and Takahiro Katagiri. Layer-wise MoE routing locality under shared-prefix code generation: Token-identity decomposition and compile-equivalent fork redundancy. arXiv preprint arXiv:2604.17182, 2026. URLhttps://arxiv.org/abs/2604.17182

Pith/arXiv arXiv 2026
[8]

Slim-SC: Thought pruning for efficient scaling with self-consistency

Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, and Dmitrii Ustiugov. Slim-SC: Thought pruning for efficient scaling with self-consistency. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34500–34517, Suzhou, China, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.emnlp-main.1750...

work page doi:10.18653/v1/2025.emnlp-main.1750 2025
[9]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

Pith/arXiv arXiv 2024
[10]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024
[11]

The path of least resistance: Guiding LLM reasoning trajectories with prefix consensus

Ishan Jindal, Sai Prashanth Akuthota, Jayant Taneja, and Sachin Dev Sharma. The path of least resistance: Guiding LLM reasoning trajectories with prefix consensus. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=hrnSqERgPn

2026
[12]

OpenAI Harmony Response Format, August 5 2025

Dominik Kundel. OpenAI Harmony Response Format, August 5 2025. URLhttps://developers.openai.com/ cookbook/articles/openai-harmony. OpenAI Cookbook. Accessed 2026-05-07

2025
[13]

GShard: Scaling giant models with conditional computation and automatic sharding

DmitryLepikhin,HyoukJoongLee,YuanzhongXu,DehaoChen,OrhanFirat,YanpingHuang,MaximKrikun,Noam 12 Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021. URLhttps://arxiv.org/abs/2006.16668

Pith/arXiv arXiv 2021
[14]

Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=ndR8Ytrzhh

2024
[15]

Inharmonywithgpt-oss

BorislavMavrin. Inharmonywithgpt-oss. arXivpreprintarXiv:2604.00362,2026. URL https://arxiv.org/abs/ 2604.00362

arXiv 2026
[16]

Introducing SWE-bench verified

OpenAI. Introducing SWE-bench verified. OpenAI blog, 2024. URL https://openai.com/index/ introducing-swe-bench-verified/

2024
[17]

gpt-oss-120b & gpt-oss-20b Model Card, August 5 2025

OpenAI. gpt-oss-120b & gpt-oss-20b Model Card, August 5 2025. URL https://cdn.openai.com/pdf/ 419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf. Accessed 2026-05-07

2025
[18]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. URLhttps://openreview.net/forum?id=B1ckMDqlg

2017
[19]

Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling

Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3613–3635, Albuquerque, New Mexico, 2...

work page doi:10.18653/v1/2025.naacl-long.184 2025
[20]

The myth of expert specialization in MoEs: Why routing reflects geometry, not necessarily domain expertise

Xi Wang, Soufiane Hayou, and Eric Nalisnick. The myth of expert specialization in MoEs: Why routing reflects geometry, not necessarily domain expertise. arXiv preprint arXiv:2604.09780, 2026. URLhttps://arxiv.org/ abs/2604.09780

Pith/arXiv arXiv 2026
[21]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2203.11171

Pith/arXiv arXiv 2023
[22]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35, 2022. URLhttps://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2022
[23]

OpenMoE: An early effort on open mixture-of-experts language models

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. OpenMoE: An early effort on open mixture-of-experts language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Lear...
[24]

URLhttps://proceedings.mlr.press/v235/xue24c.html
[25]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. URLhttps://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024
[26]

Beyond benchmarks: Understanding mixture-of-experts models through internal mechanisms

Jiahao Ying, Mingbao Lin, Qianru Sun, and Yixin Cao. Beyond benchmarks: Understanding mixture-of-experts models through internal mechanisms. arXiv preprint arXiv:2509.23933, 2025. URLhttps://arxiv.org/abs/ 2509.23933

arXiv 2025
[27]

anchor”):so in blue,.\n\n in red. The horizontal orange line (printed in-panel with the earlier label “post-\boxed

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification. arXiv preprint arXiv:2504.05419, 2025. URL https://arxiv.org/abs/2504.05419. A Technical appendices and supplementary material This appendix contains the full answer-string-free protocol ...

arXiv 2025

[1] [1]

Modelutilitylaw: Evaluating LLMs beyond performance through mechanism interpretable metric

YixinCao,JiahaoYing,YaoningWang,XipengQiu,XuanjingHuang,andYugangJiang. Modelutilitylaw: Evaluating LLMs beyond performance through mechanism interpretable metric. arXiv preprint arXiv:2504.07440, 2025. URL https://arxiv.org/abs/2504.07440

arXiv 2025

[2] [2]

Do LLMs signal when they’re right? evidence from neuron agreement, 2025

Kang Chen, Yaoning Wang, Kai Xiong, Zhuoka Feng, Wenhe Sun, Haotian Chen, and Yixin Cao. Do LLMs signal when they’re right? evidence from neuron agreement, 2025. URLhttps://arxiv.org/abs/2510.26277

arXiv 2025

[3] [3]

NEX: Neuron explore-exploit scoring for label-free chain-of-thought selection and model ranking

Kang Chen, Zhuoka Feng, Sihan Zhao, Kai Xiong, Junjie Nian, Yaoning Wang, Changyi Xiao, and Yixin Cao. NEX: Neuron explore-exploit scoring for label-free chain-of-thought selection and model ranking. arXiv preprint arXiv:2602.05805, 2026. URLhttps://arxiv.org/abs/2602.05805

arXiv 2026

[4] [4]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.x. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.k. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association...

work page doi:10.18653/v1/2024.acl-long.70 2024

[5] [5]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022. URLhttps://jmlr.org/ papers/v23/21-0998.html

2022

[6] [6]

Deep think with confidence

Yichao Fu, Xuewei Wang, Hao Zhang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=8LqHs0KIM7

2026

[7] [7]

Layer-wise MoE routing locality under shared-prefix code generation: Token-identity decomposition and compile-equivalent fork redundancy

Shun-ichiro Hayashi, Daichi Mukunoki, Tetsuya Hoshino, and Takahiro Katagiri. Layer-wise MoE routing locality under shared-prefix code generation: Token-identity decomposition and compile-equivalent fork redundancy. arXiv preprint arXiv:2604.17182, 2026. URLhttps://arxiv.org/abs/2604.17182

Pith/arXiv arXiv 2026

[8] [8]

Slim-SC: Thought pruning for efficient scaling with self-consistency

Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, and Dmitrii Ustiugov. Slim-SC: Thought pruning for efficient scaling with self-consistency. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34500–34517, Suzhou, China, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.emnlp-main.1750...

work page doi:10.18653/v1/2025.emnlp-main.1750 2025

[9] [9]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

Pith/arXiv arXiv 2024

[10] [10]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024

[11] [11]

The path of least resistance: Guiding LLM reasoning trajectories with prefix consensus

Ishan Jindal, Sai Prashanth Akuthota, Jayant Taneja, and Sachin Dev Sharma. The path of least resistance: Guiding LLM reasoning trajectories with prefix consensus. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=hrnSqERgPn

2026

[12] [12]

OpenAI Harmony Response Format, August 5 2025

Dominik Kundel. OpenAI Harmony Response Format, August 5 2025. URLhttps://developers.openai.com/ cookbook/articles/openai-harmony. OpenAI Cookbook. Accessed 2026-05-07

2025

[13] [13]

GShard: Scaling giant models with conditional computation and automatic sharding

DmitryLepikhin,HyoukJoongLee,YuanzhongXu,DehaoChen,OrhanFirat,YanpingHuang,MaximKrikun,Noam 12 Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021. URLhttps://arxiv.org/abs/2006.16668

Pith/arXiv arXiv 2021

[14] [14]

Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=ndR8Ytrzhh

2024

[15] [15]

Inharmonywithgpt-oss

BorislavMavrin. Inharmonywithgpt-oss. arXivpreprintarXiv:2604.00362,2026. URL https://arxiv.org/abs/ 2604.00362

arXiv 2026

[16] [16]

Introducing SWE-bench verified

OpenAI. Introducing SWE-bench verified. OpenAI blog, 2024. URL https://openai.com/index/ introducing-swe-bench-verified/

2024

[17] [17]

gpt-oss-120b & gpt-oss-20b Model Card, August 5 2025

OpenAI. gpt-oss-120b & gpt-oss-20b Model Card, August 5 2025. URL https://cdn.openai.com/pdf/ 419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf. Accessed 2026-05-07

2025

[18] [18]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. URLhttps://openreview.net/forum?id=B1ckMDqlg

2017

[19] [19]

Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling

Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3613–3635, Albuquerque, New Mexico, 2...

work page doi:10.18653/v1/2025.naacl-long.184 2025

[20] [20]

The myth of expert specialization in MoEs: Why routing reflects geometry, not necessarily domain expertise

Xi Wang, Soufiane Hayou, and Eric Nalisnick. The myth of expert specialization in MoEs: Why routing reflects geometry, not necessarily domain expertise. arXiv preprint arXiv:2604.09780, 2026. URLhttps://arxiv.org/ abs/2604.09780

Pith/arXiv arXiv 2026

[21] [21]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2203.11171

Pith/arXiv arXiv 2023

[22] [22]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35, 2022. URLhttps://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2022

[23] [23]

OpenMoE: An early effort on open mixture-of-experts language models

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. OpenMoE: An early effort on open mixture-of-experts language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Lear...

[24] [24]

URLhttps://proceedings.mlr.press/v235/xue24c.html

[25] [25]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. URLhttps://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024

[26] [26]

Beyond benchmarks: Understanding mixture-of-experts models through internal mechanisms

Jiahao Ying, Mingbao Lin, Qianru Sun, and Yixin Cao. Beyond benchmarks: Understanding mixture-of-experts models through internal mechanisms. arXiv preprint arXiv:2509.23933, 2025. URLhttps://arxiv.org/abs/ 2509.23933

arXiv 2025

[27] [27]

anchor”):so in blue,.\n\n in red. The horizontal orange line (printed in-panel with the earlier label “post-\boxed

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification. arXiv preprint arXiv:2504.05419, 2025. URL https://arxiv.org/abs/2504.05419. A Technical appendices and supplementary material This appendix contains the full answer-string-free protocol ...

arXiv 2025