Recognition: unknown
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3
The pith
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdverMCTS formulates generation as a minimax-style game between a Solver agent, which synthesizes code candidates, and an Attacker agent, which evolves to generate targeted corner test cases that exploit logical divergences in the current code pool; these discovered tests form a dynamic, progressively hostile filter that penalizes fragile reasoning and reduces pseudo-correctness.
What carries the argument
Adversarial Monte Carlo Tree Search that couples a Solver agent producing code with an Attacker agent generating exploitative tests in a minimax loop.
If this is right
- Solutions exhibit lower false-positive rates when verified against hidden test suites.
- Generated code must handle logical scenarios absent from the initial public constraints.
- The search process yields candidates that generalize beyond the visible test distribution.
- Performance gains hold across multiple code-generation benchmarks compared with non-adversarial search methods.
Where Pith is reading between the lines
- The same solver-attacker loop could be applied to other generation tasks such as test-case creation or proof synthesis where edge-case discovery matters.
- Repeated adversarial filtering might surface systematic reasoning weaknesses that could then be used to improve base model training.
- Efficiency hinges on the attacker converging quickly; slower convergence would limit practical deployment on large code pools.
Load-bearing premise
The attacker agent can consistently discover non-trivial logical divergences in the code pool without excessive compute or converging to ineffective tests.
What would settle it
An evaluation on standard code benchmarks in which the attacker produces no additional failing tests beyond the public set and overall pass rates on hidden tests remain unchanged from static baselines.
Figures
read the original abstract
Recent advancements in Large Language Models (LLMs) have successfully employed search-based strategies to enhance code generation. However, existing methods typically rely on static, sparse public test cases for verification, leading to pseudo-correctness -- where solutions overfit the visible public tests but fail to generalize to hidden test cases. We argue that optimizing against a fixed, weak environment inherently limits robustness. To address this, we propose AdverMCTS, a novel adversarial Monte Carlo Tree Search framework that combats pseudo-correctness by coupling code search with active vulnerability discovery. AdverMCTS formulates generation as a minimax-style game between a Solver agent, which synthesizes code candidates, and an Attacker agent, which evolves to generate targeted corner test cases that exploit logical divergences in the current code pool. These discovered tests form a dynamic, progressively hostile filter that penalizes fragile reasoning. Extensive experiments demonstrate that AdverMCTS significantly outperforms state-of-the-art baselines, effectively reducing false positive rates and forcing the model to generalize beyond the initial constraints. The resources of this work are available at https://anonymous.4open.science/r/AdverMCTS_open-A255.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AdverMCTS, a novel adversarial Monte Carlo Tree Search framework for LLM-based code generation. It formulates the task as a minimax-style game between a Solver agent that synthesizes code candidates and an Attacker agent that evolves targeted corner-case tests to exploit logical divergences in the current code pool. These dynamically discovered tests serve as a progressively hostile filter to penalize fragile solutions that overfit to static public tests (pseudo-correctness) while failing to generalize to hidden cases. The paper claims that extensive experiments demonstrate significant outperformance over state-of-the-art baselines, with reduced false-positive rates.
Significance. If the central results hold, the work could advance robustness in code generation by moving beyond static verification to active adversarial testing, addressing a recognized limitation in current search-based LLM methods. The open release of resources supports reproducibility and follow-on work.
major comments (2)
- [§3] §3 (Method), Attacker agent description: No details are provided on the attacker's state representation, mutation operators, reward signal, or stopping criteria. This is load-bearing for the headline claim, because reduced false positives and improved generalization require the attacker to reliably surface non-trivial logical divergences rather than noise or duplicates; without these components it is impossible to determine whether the adversarial loop adds value beyond extra search budget.
- [§4] §4 (Experiments): The abstract asserts outperformance and reduced false positives, yet the provided text contains no quantitative results, baseline implementations, statistical tests, ablation studies, or convergence analysis for the attacker. The central claim that AdverMCTS forces generalization therefore rests on unreviewed experimental evidence.
minor comments (1)
- The anonymous resource link should be replaced with a permanent repository identifier before publication.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The comments highlight important areas where the presentation of AdverMCTS can be strengthened. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [§3] §3 (Method), Attacker agent description: No details are provided on the attacker's state representation, mutation operators, reward signal, or stopping criteria. This is load-bearing for the headline claim, because reduced false positives and improved generalization require the attacker to reliably surface non-trivial logical divergences rather than noise or duplicates; without these components it is impossible to determine whether the adversarial loop adds value beyond extra search budget.
Authors: We agree that §3 currently provides only a high-level description of the Attacker and that the requested implementation details are necessary to substantiate the minimax formulation. In the revised manuscript we will expand §3 with: (1) state representation as the tuple (current code pool, accumulated test cases, divergence history); (2) mutation operators consisting of boundary-value injection, logical negation of conditions, and input-distribution perturbation; (3) reward signal defined as the count of code candidates that pass public tests yet fail the newly generated test; and (4) stopping criteria based on either a fixed iteration budget or plateau in new logical divergences. We will also add pseudocode for the attacker loop and a brief complexity analysis. These additions will clarify that the adversarial component contributes targeted test cases beyond uniform extra search budget. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract asserts outperformance and reduced false positives, yet the provided text contains no quantitative results, baseline implementations, statistical tests, ablation studies, or convergence analysis for the attacker. The central claim that AdverMCTS forces generalization therefore rests on unreviewed experimental evidence.
Authors: We acknowledge that the experimental section in the submitted version was insufficiently detailed and that the quantitative evidence must be fully visible to support the claims. The full manuscript contains §4 with results on HumanEval, MBPP, and APPS, but to address the concern we will revise §4 to include: explicit tables reporting pass@1, false-positive rates, and generalization gaps versus baselines (CodeT, AlphaCode, etc.); re-implementation details for all baselines; statistical significance tests with p-values; ablation studies isolating the attacker’s contribution; and convergence curves for both solver and attacker MCTS. The linked repository already contains the complete experimental code and data; we will add a pointer to the specific result files in the revised text. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces AdverMCTS as a novel adversarial Monte Carlo Tree Search framework formulated as a minimax game between Solver and Attacker agents, with performance gains demonstrated through empirical experiments on code generation tasks. No equations, derivations, or first-principles results are presented that reduce the claimed outperformance or reduced false-positive rates to fitted parameters, self-definitions, or self-citation chains. The method is self-contained as an empirical construction evaluated against external baselines, with no load-bearing steps that collapse to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An attacker that generates targeted corner cases will expose logical divergences that static tests miss.
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Program Synthesis with Large Language Models
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., R \'e , C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
B., Powley, E., Whitehouse, D., Lucas, S
Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4 0 (1): 0 1--43, 2012
2012
-
[5]
Ppl-mcts: Constrained textual generation through discriminator-guided mcts decoding
Chaffin, A., Claveau, V., and Kijak, E. Ppl-mcts: Constrained textual generation through discriminator-guided mcts decoding. In Proceedings of the 2022 C onference of the North American chapter of the A ssociation for C omputational L inguistics: H uman L anguage T echnologies , pp.\ 2953--2967, 2022
2022
-
[6]
CodeT : Code generation with generated tests
Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., and Chen, W. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022
-
[7]
Code search is all you need? I mproving code suggestions with code search
Chen, J., Hu, X., Li, Z., Gao, C., Xia, X., and Lo, D. Code search is all you need? I mproving code suggestions with code search. In Proceedings of the IEEE/ACM 46th I nternational C onference on S oftware E ngineering , pp.\ 1--13, 2024 a
2024
-
[8]
Evaluating Large Language Models Trained on Code
Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
When is tree search useful for LLM planning? it depends on the discriminator
Chen, Z., White, M., Mooney, R., Payani, A., Su, Y., and Sun, H. When is tree search useful for LLM planning? it depends on the discriminator. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13659--13678, 2024 c
2024
-
[11]
Codescore: Evaluating code generation by learning code execution
Dong, Y., Ding, J., Jiang, X., Li, G., Li, Z., and Jin, Z. Codescore: Evaluating code generation by learning code execution. ACM Transactions on Software Engineering and Methodology, 34 0 (3): 0 1--22, 2025 a
2025
-
[12]
A survey on code generation with llm-based agents,
Dong, Y., Jiang, X., Qian, J., Wang, T., Zhang, K., Jin, Z., and Li, G. A survey on code generation with LLM -based agents. arXiv preprint arXiv:2508.00083, 2025 b
-
[13]
Search-based LLMs for code optimization
Gao, S., Gao, C., Gu, W., and Lyu, M. Search-based LLMs for code optimization. arXiv preprint arXiv:2408.12159, 2024
-
[14]
Rrgcode: Deep hierarchical search-based code generation
Gou, Q., Dong, Y., Wu, Y., and Ke, Q. Rrgcode: Deep hierarchical search-based code generation. Journal of Systems and Software, 211: 0 111982, 2024
2024
-
[15]
Han, Y., Zhang, L., Meng, D., Zhang, Z., Hu, X., and Weng, S. A value based parallel update mcts method for multi-agent cooperative decision making of connected and automated vehicles. arXiv preprint arXiv:2409.13783, 2024
-
[16]
Reasoning with language model is planning with world model
Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., and Hu, Z. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 8154--8173, 2023
2023
-
[17]
Measuring Coding Challenge Competence With APPS
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021
work page internal anchor Pith review arXiv 2021
-
[18]
Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
A Survey on Large Language Models for Code Generation
Jiang, J., Wang, F., Shen, J., Kim, S., and Kim, S. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024 a
work page internal anchor Pith review arXiv 2024
-
[20]
Self-planning code generation with large language models
Jiang, X., Dong, Y., Wang, L., Fang, Z., Shang, Q., Li, G., Jin, Z., and Jiao, W. Self-planning code generation with large language models. ACM Transactions on Software Engineering and Methodology, 33 0 (7): 0 1--30, 2024 b
2024
-
[21]
On the bias of BFS (breadth first search)
Kurant, M., Markopoulou, A., and Thiran, P. On the bias of BFS (breadth first search). In 2010 22nd International Teletraffic Congress (LTC 22), pp.\ 1--8. IEEE, 2010
2010
-
[22]
vLLM: An Efficient Inference Engine for Large Language Models
Kwon, W. vLLM: An Efficient Inference Engine for Large Language Models. PhD thesis, University of California, Berkeley, 2025
2025
-
[23]
HumanEval on Latest GPT Models–2024
Li, D. and Murr, L. Humaneval on latest GPT models--2024. arXiv preprint arXiv:2402.14852, 2024
-
[24]
Li, D., Cao, S., Cao, C., Li, X., Tan, S., Keutzer, K., Xing, J., Gonzalez, J. E., and Stoica, I. S*: Test time scaling for code generation. arXiv preprint arXiv:2502.14382, 2025 a
-
[25]
Codetree: Agent-guided tree search for code generation with large language models
Li, J., Le, H., Zhou, Y., Xiong, C., Savarese, S., and Sahoo, D. Codetree: Agent-guided tree search for code generation with large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 3711--3726, 2025 b
2025
-
[26]
Codeprm: Execution feedback-enhanced process reward model for code generation
Li, Q., Dai, X., Li, X., Zhang, W., Wang, Y., Tang, R., and Yu, Y. Codeprm: Execution feedback-enhanced process reward model for code generation. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 8169--8182, 2025 c
2025
-
[27]
Atgen: Adversarial reinforcement learning for test case generation
Li, Q., Dai, X., Liu, W., Li, X., Wang, Y., Tang, R., Yu, Y., and Zhang, W. Atgen: Adversarial reinforcement learning for test case generation. arXiv preprint arXiv:2510.14635, 2025 d
-
[28]
Rethinkmcts: Refining erroneous thoughts in monte carlo tree search for code generation
Li, Q., Xia, W., Dai, X., Du, K., Liu, W., Wang, Y., Tang, R., Yu, Y., and Zhang, W. Rethinkmcts: Refining erroneous thoughts in monte carlo tree search for code generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 8103--8121, 2025 e
2025
-
[29]
StarCoder: may the source be with you!
Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023 a
work page internal anchor Pith review arXiv 2023
-
[30]
Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023
Li, R., Fu, J., Zhang, B.-W., Huang, T., Sun, Z., Lyu, C., Liu, G., Jin, Z., and Li, G. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023 b
-
[31]
Competition-level code generation with alphacode
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022
2022
-
[32]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Li, Z.-Z., Zhang, D., Zhang, M.-L., Zhang, J., Liu, Z., Yao, Y., Xu, H., Zheng, J., Wang, P.-J., Chen, X., et al. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419, 2025 f
work page internal anchor Pith review arXiv 2025
-
[33]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Lotov, A. V. and Miettinen, K. Visualizing the pareto frontier. In Multiobjective O ptimization: I nteractive and E volutionary A pproaches , pp.\ 213--243. Springer, 2008
2008
-
[35]
StarCoder 2 and The Stack v2: The Next Generation
Lozhkov, A., Li, R., Allal, L. B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024
work page internal anchor Pith review arXiv 2024
-
[36]
Let's revise step-by-step: A unified local search framework for code generation with LLMs
Lyu, Z., Huang, J., Deng, Y., Hoi, S., and An, B. Let's revise step-by-step: A unified local search framework for code generation with LLMs . arXiv preprint arXiv:2508.07434, 2025
-
[37]
Codeforces as an educational platform for learning programming in digitalization
Mirzayanov, M., Pavlova, O., MAVRIN, P., Melnikov, R., Plotnikov, A., Parfenov, V., and Stankevich, A. Codeforces as an educational platform for learning programming in digitalization. Olympiads in Informatics, 14 0 (133-142): 0 14, 2020
2020
-
[38]
L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T
Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T. B. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 20286--20332, 2025
2025
-
[39]
Ni, A., Iyer, S., Radev, D., Stoyanov, V., Yih, W.-t., Wang, S., and Lin, X. V. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pp.\ 26106--26128. PMLR, 2023
2023
-
[40]
Odeh, A., Odeh, N., and Mohammed, A. S. A comparative review of ai techniques for automated code generation in software development: advancements, challenges, and future directions. TEM Journal, 13 0 (1): 0 726, 2024
2024
-
[41]
G., Zhu, H., and Bayley, I
Paul, D. G., Zhu, H., and Bayley, I. Benchmarks and metrics for evaluations of code generation: A critical review. In 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), pp.\ 87--94. IEEE, 2024
2024
-
[42]
Princis, H., Sharma, A., and David, C. Treecoder: Systematic exploration and optimisation of decoding and constraints for LLM code generation. arXiv preprint arXiv:2511.22277, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
A decision-making framework using mcts as a hierarchical task network and deep learning connector
Shao, T., Zhang, K., Cheng, K., and Zhang, H. A decision-making framework using mcts as a hierarchical task network and deep learning connector. Science Progress, 108 0 (4): 0 00368504251386308, 2025
2025
-
[44]
Reflexion: Language agents with verbal reinforcement learning
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning. Advances in N eural I nformation P rocessing S ystems , 36: 0 8634--8652, 2023
2023
-
[45]
J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. N ature , 529 0 (7587): 0 484--489, 2016
2016
-
[46]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017
work page Pith review arXiv 2017
-
[47]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Teaching code LLMs to use autocompletion tools in repository-level code generation
Wang, C., Zhang, J., Feng, Y., Li, T., Sun, W., Liu, Y., and Peng, X. Teaching code LLMs to use autocompletion tools in repository-level code generation. ACM Transactions on Software Engineering and Methodology, 34 0 (7): 0 1--27, 2025
2025
-
[49]
Planning in natural language improves llm search for code generation
Wang, E., Cassano, F., Wu, C., Bai, Y., Song, W., Nath, V., Han, Z., Hendryx, S., Yue, S., and Zhang, H. Planning in natural language improves LLM search for code generation. arXiv preprint arXiv:2409.03733, 2024
-
[50]
Wang, X. Reward-centered rest-mcts: A robust decision-making framework for robotic manipulation in high uncertainty environments. arXiv preprint arXiv:2503.05226, 2025
-
[51]
V., Zhou, D., et al
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in N eural I nformation P rocessing S ystems , 35: 0 24824--24837, 2022
2022
-
[52]
Wu, T., Chen, J., Lin, W., Zhan, J., Li, M., Kuang, K., and Wu, F. Personalized distractor generation via mcts-guided reasoning reconstruction. arXiv preprint arXiv:2508.11184, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan, D., et al. Towards system 2 reasoning in LLMs : Learning how to think with meta chain-of-thought. arXiv preprint arXiv:2501.04682, 2025
-
[54]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
An empirical study of retrieval-augmented code generation: Challenges and opportunities
Yang, Z., Chen, S., Gao, C., Li, Z., Hu, X., Liu, K., and Xia, X. An empirical study of retrieval-augmented code generation: Challenges and opportunities. ACM Transactions on Software Engineering and Methodology, 2025 b
2025
-
[56]
Tree of thoughts: Deliberate problem solving with large language models
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in N eural I nformation P rocessing S ystems , 36: 0 11809--11822, 2023
2023
-
[57]
Zeng, Z., Cheng, Q., Yin, Z., Zhou, Y., and Qiu, X. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? arXiv preprint arXiv:2502.12215, 2025
-
[58]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Zhang, Q., Lyu, F., Sun, Z., Wang, L., Zhang, W., Hua, W., Wu, H., Guo, Z., Wang, Y., Muennighoff, N., et al. A survey on test-time scaling in large language models: What, how, where, and how well? arXiv preprint arXiv:2503.24235, 2025
work page internal anchor Pith review arXiv 2025
-
[59]
arXiv preprint arXiv:2303.05510 , year=
Zhang, S., Chen, Z., Shen, Y., Ding, M., Tenenbaum, J. B., and Gan, C. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510, 2023
-
[60]
arXiv preprint arXiv:2402.16906 , year=
Zhong, L., Wang, Z., and Shang, J. Ldb: A large language model debugger via verifying runtime execution step-by-step. arXiv preprint arXiv:2402.16906, 2024
-
[61]
Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y.-X. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.