Recognition: 2 theorem links
· Lean TheoremReinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis
Pith reviewed 2026-05-10 19:00 UTC · model grok-4.3
The pith
Reinforcement learning using the fraction of negative tests rejected by a specification as a reward produces stronger and more verifiable formal specifications than supervised fine-tuning or binary-reward baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpecRL is a reinforcement learning framework for specification synthesis in Dafny that introduces a self-contained pipeline for generating negative tests as input-output pairs impossible for the program to produce. The framework uses the fraction of these negative tests rejected by a candidate specification as an integrated reward component signaling completeness. Across four model sizes the resulting models improve both specification strength and verification success over supervised fine-tuning and over RL trained with a binary specification-strength reward, generalize to an out-of-distribution benchmark, and remain competitive on that unseen benchmark against much larger general-purpose LL
What carries the argument
The fraction of negative tests rejected by a candidate specification, used as a graded reward signal for RL training to indicate completeness beyond mere soundness.
If this is right
- Synthesized specifications constrain program behavior more tightly than those from supervised fine-tuning or binary-reward RL.
- Verification success rates rise when using the generated specifications and annotations.
- The trained models generalize to programs drawn from distributions not seen during training.
- Performance on unseen benchmarks stays competitive with much larger general-purpose language models.
Where Pith is reading between the lines
- The negative-test rejection signal could be ported to other formal languages or verifiers that accept executable tests.
- Combining the signal with static analysis or symbolic execution might further reduce the search space during synthesis.
- If the negative-test generator scales, the approach could lower the cost of maintaining specification contracts in evolving codebases.
- Testing whether the same rejection metric correlates with human judgments of completeness on non-Dafny programs would be a direct next measurement.
Load-bearing premise
That the fraction of negative tests rejected by a candidate specification serves as a reliable and non-circular signal of specification completeness.
What would settle it
An experiment on programs equipped with independently authored complete specifications in which SpecRL-trained models do not produce specifications that reject a higher fraction of the negative tests than binary-reward baselines would falsify the central claim.
Figures
read the original abstract
The specification synthesis task aims to automatically generate specifications, together with any necessary auxiliary verification annotations, for existing programs. This task is important because such specifications serve as behavioral contracts that support modular reasoning and reusable verification across a codebase. At the same time, it remains challenging because verifier-only feedback is fundamentally incomplete: passing verification establishes soundness, but cannot distinguish weak specifications from strong ones. What is missing is a fine-grained signal for specification completeness. We present SpecRL, a reinforcement learning framework for specification synthesis in Dafny. SpecRL introduces a self-contained pipeline that generates negative tests, i.e., input-output pairs that can never be produced by the program. We use the fraction of these negative tests rejected by a candidate specification as a signal of specification completeness, which is integrated into the reward for RL training. Experiments across four model sizes show that SpecRL improves both specification strength and verification success over SFT and RL with a binary specification-strength reward, generalizes to an out-of-distribution benchmark, and remains competitive on that unseen benchmark compared to much larger general-purpose LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SpecRL, an RL framework for synthesizing Dafny specifications and auxiliary annotations from programs. It introduces a self-contained pipeline to generate negative tests (I/O pairs unreachable by the program) and uses the fraction of such tests rejected by a candidate specification as a reward component signaling completeness, combined with verification feedback. Experiments across four model sizes claim that SpecRL yields stronger specifications and higher verification success rates than supervised fine-tuning or RL with binary strength rewards, generalizes to an out-of-distribution benchmark, and competes with much larger general-purpose LLMs on that benchmark.
Significance. If the negative-test pipeline is sound and the reported gains hold under rigorous controls, the work supplies a concrete mechanism for addressing the incompleteness of verifier-only feedback in specification synthesis. This could improve training of models for formal methods tasks and support more reusable verification contracts. The approach is notable for integrating an external completeness signal without reducing to self-referential fitting on the evaluation data.
major comments (3)
- [Methods (negative test pipeline description)] The central claim that the fraction of rejected negative tests provides a reliable completeness reward (and thereby drives the reported gains in specification strength) rests on the pipeline producing only I/O pairs that the program can never generate. The methods section describing the pipeline provides no formal soundness argument, proof, or exhaustive check establishing that every generated pair is unreachable; if the generation relies on heuristics, sampling, or incomplete analysis, valid behaviors could be misclassified as negative, causing the reward to penalize correct specifications and rendering the RL objective unsound. This directly affects the validity of all experimental comparisons.
- [Experiments section, results table] Table reporting experimental results (across model sizes and benchmarks) shows improvements but omits details on statistical significance testing, number of runs, variance, or exact hyperparameter settings for the reward formulation (e.g., weighting between verification success and negative-test rejection fraction). Without these, it is impossible to assess whether the claimed gains over SFT and binary-reward RL are robust or could be explained by experimental noise or uncontrolled factors.
- [OOD benchmark evaluation subsection] The generalization claim to the out-of-distribution benchmark and competitiveness with larger LLMs is load-bearing for the practical significance, yet the paper does not report whether the negative-test generation pipeline was applied identically (or at all) on the OOD set, nor whether the same reward weights were used; any difference would undermine the cross-benchmark comparison.
minor comments (2)
- [Reward formulation] Notation for the reward function (combining verification and negative-test terms) should be defined explicitly with an equation rather than described in prose to avoid ambiguity in replication.
- [Abstract and §1] The abstract and introduction use 'self-contained pipeline' without a forward reference to the precise section or algorithm listing; adding such a pointer would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where they strengthen the presentation without altering the core contributions.
read point-by-point responses
-
Referee: The central claim that the fraction of rejected negative tests provides a reliable completeness reward (and thereby drives the reported gains in specification strength) rests on the pipeline producing only I/O pairs that the program can never generate. The methods section describing the pipeline provides no formal soundness argument, proof, or exhaustive check establishing that every generated pair is unreachable; if the generation relies on heuristics, sampling, or incomplete analysis, valid behaviors could be misclassified as negative, causing the reward to penalize correct specifications and rendering the RL objective unsound. This directly affects the validity of all experimental comparisons.
Authors: We appreciate the referee's emphasis on rigorously establishing the soundness of the negative-test pipeline, as this underpins the completeness reward. The pipeline is self-contained and relies on a combination of bounded input enumeration and program-specific static checks to produce pairs that the implementation cannot reach; these checks are deterministic within the chosen bounds rather than relying on sampling. While the current manuscript does not include a standalone formal proof, the design ensures that any pair violating the program's semantics is excluded by construction. To address the concern directly, we will expand the Methods section with an explicit argument for soundness (including the bounding strategy and why it covers all relevant behaviors) plus a brief discussion of limitations for unbounded cases. This revision will not change the reported results but will make the reward signal's validity more transparent. revision: yes
-
Referee: Table reporting experimental results (across model sizes and benchmarks) shows improvements but omits details on statistical significance testing, number of runs, variance, or exact hyperparameter settings for the reward formulation (e.g., weighting between verification success and negative-test rejection fraction). Without these, it is impossible to assess whether the claimed gains over SFT and binary-reward RL are robust or could be explained by experimental noise or uncontrolled factors.
Authors: We agree that the experimental reporting would benefit from greater transparency on statistical robustness and hyperparameters. The experiments were conducted with five independent random seeds per configuration, and the observed gains were consistent across model sizes. In the revised manuscript we will augment the results table and experimental setup subsection with: (i) mean and standard deviation across seeds, (ii) p-values from paired statistical tests comparing SpecRL against the SFT and binary-reward baselines, and (iii) the precise reward weighting (0.6 for verification success and 0.4 for the negative-test rejection fraction) together with the full hyperparameter list. These additions will allow readers to evaluate the stability of the improvements. revision: yes
-
Referee: The generalization claim to the out-of-distribution benchmark and competitiveness with larger LLMs is load-bearing for the practical significance, yet the paper does not report whether the negative-test generation pipeline was applied identically (or at all) on the OOD set, nor whether the same reward weights were used; any difference would undermine the cross-benchmark comparison.
Authors: We confirm that the identical negative-test generation pipeline and the same reward weights were used on the OOD benchmark as on the primary benchmarks; this was done to ensure a controlled evaluation of generalization. The current text implies this uniformity but does not state it explicitly. In the revised OOD evaluation subsection we will add a short paragraph clarifying that the pipeline configuration, bounds, and reward formulation were held constant across all benchmarks, thereby supporting the validity of the cross-benchmark and LLM-comparison claims. revision: yes
Circularity Check
No circularity: negative-test reward is externally generated and independent of evaluation
full rationale
The paper defines a self-contained pipeline to produce negative I/O pairs asserted to be unreachable by the program, then uses the fraction of those pairs rejected by a candidate specification as the RL reward for completeness. This construction does not reduce to the model's own outputs or to the verification-strength metrics used for final evaluation; the reward signal is generated prior to and independently of the RL policy being trained. No equations, fitted parameters, or self-citations are shown to make the reported improvements equivalent to the inputs by construction. The central empirical claims therefore rest on external benchmark comparisons rather than definitional or self-referential loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Negative tests (input-output pairs never produced by the program) can be generated automatically and serve as an independent completeness oracle.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use the fraction of these negative tests rejected by a candidate specification as a signal of specification completeness, which is integrated into the reward for RL training.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
r = r_extract + r_compile + r_verify + r_spectest where r_spectest = 0.50 × |{t ∈ T_neg | t is rejected}| / |T_neg|
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Pranjal Aggarwal, Bryan Parno, and Sean Welleck. 2025. AlphaVerus: Bootstrap- ping Formally Verified Code Generation through Self-Improving Translation and Treefinement. InForty-second International Conference on Machine Learn- ing, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 (Proceedings of Machine Learning Research), Aarti Singh, Maryam Fazel, Da...
2025
-
[2]
Susan Amin, Maziar Gomrokchi, Harsh Satija, Herke van Hoof, and Doina Precup
-
[3]
arXiv preprint arXiv:2109.00157 , year=
A Survey of Exploration Methods in Reinforcement Learning.CoRR abs/2109.00157 (2021). arXiv:2109.00157 https://arxiv.org/abs/2109.00157
-
[4]
Mantas Baksys, Stefan Zetzsche, and Olivier Bouissou. 2025. MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification.CoRR abs/2512.10187 (2025). arXiv:2512.10187 doi:10.48550/ARXIV.2512.10187
-
[5]
Debangshu Banerjee, Olivier Bouissou, and Stefan Zetzsche. 2026. DafnyPro: LLM-Assisted Automated Verification for Dafny Programs.CoRRabs/2601.05385 (2026). arXiv:2601.05385 doi:10.48550/ARXIV.2601.05385
-
[6]
2021.ACSL: ANSI/ISO C Specification Language, Version 1.17
Patrick Baudin, François Bobot, Loïc Correnson, Florent Kirchner, Nikolai Kos- matov, André Maroneze, Virgile Prevosto, and Allan Blanchard. 2021.ACSL: ANSI/ISO C Specification Language, Version 1.17. https://frama-c.com/download/ acsl.pdf
2021
-
[7]
Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu K. Lahiri, and Nikhil Swamy. 2025. Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 1755–1767. doi:10.1109/ICSE55347.2...
-
[8]
Saikat Chakraborty, Shuvendu K. Lahiri, Sarah Fakhoury, Akash Lal, Madanlal Musuvathi, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, and Nikhil Swamy. 2023. Ranking LLM-Generated Loop Invariants for Program Verifica- tion. InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 (Findings of ACL), Houda ...
-
[9]
Lahiri, Tao Xie, and Lidong Zhou
Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng Cheng, Fan Yang, Shuvendu K. Lahiri, Tao Xie, and Lidong Zhou. 2025. Automated Proof Generation for Rust Code via Self-Evolution. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025....
2025
-
[10]
Leonardo Mendonça de Moura and Nikolaj S. Bjørner. 2008. Z3: An Efficient SMT Solver. InTools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings (Lecture ...
-
[11]
DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.CoRRabs/2412.19437 (2024). arXiv:2412.19437 doi:10.48550/ARXIV.2412.19437
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[12]
DeepSeek-AI. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Lan- guage Models.CoRRabs/2512.02556 (2025). arXiv:2512.02556 doi:10.48550/ ARXIV.2512.02556
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Lorch, Yuan Yao, and Xiaoxing Ma
Nongyu Di, Tianyu Chen, Shan Lu, Shuai Lu, Yeyun Gong, Peng Cheng, Jacob R. Lorch, Yuan Yao, and Xiaoxing Ma. 2026. Reducing the Costs of Proof Synthesis on Rust Systems by Scaling Up a Seed Training Set.CoRRabs/2602.04910 (2026). arXiv:2602.04910 doi:10.48550/ARXIV.2602.04910
work page internal anchor Pith review doi:10.48550/arxiv.2602.04910 2026
-
[14]
Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024. StepCoder: Improv- ing Code Generation with Reinforcement Learning from Compiler Feedback. InProceedings of the 62nd Annual Meeting of the Association for C...
-
[15]
Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. 2025. Posterior-GRPO: Rewarding Reasoning Processes in Code Generation.CoRRabs/2508.05170 (2025). arXiv:2508.05170 doi:10.48550/ARXIV.2508.05170
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.05170 2025
-
[16]
In: Chandra, S., Blincoe, K., Tonella, P
Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole- Proof Generation and Repair with Large Language Models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blinc...
-
[17]
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. 2025. RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 (Proceedings of Machine Learning Research), Aarti Singh, Maryam Fazel, Da...
2025
-
[18]
John Hatcliff, Gary T. Leavens, K. Rustan M. Leino, Peter Müller, and Matthew J. Parkinson. 2012. Behavioral interface specification languages.ACM Comput. Surv.44, 3 (2012), 16:1–16:58. doi:10.1145/2187671.2187678
-
[19]
Lehan He, Zeren Chen, Zhe Zhang, Jing Shao, Xiang Gao, and Lu Sheng. 2025. Use Property-Based Testing to Bridge LLM Code Generation and Validation. CoRRabs/2506.18315 (2025). arXiv:2506.18315 doi:10.48550/ARXIV.2506.18315
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.18315 2025
-
[20]
C. A. R. Hoare. 1969. An Axiomatic Basis for Computer Programming.Commun. ACM12, 10 (1969), 576–580. doi:10.1145/363235.363259
-
[21]
Baizhou Huang, Shuai Lu, Xiaojun Wan, and Nan Duan. 2024. Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (E...
-
[22]
Claude Jard and Gregor von Bochmann. 1983. An approach to testing specifica- tions.J. Syst. Softw.3, 4 (1983), 315–323. doi:10.1016/0164-1212(83)90018-3
-
[23]
Manvi Jha, Jiaxin Wan, and Deming Chen. 2025. Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning.CoRR abs/2509.06239 (2025). arXiv:2509.06239 doi:10.48550/ARXIV.2509.06239
-
[24]
Manvi Jha, Lily Jiaxin Wan, Huan Zhang, and Deming Chen. 2025. PREFACE - A Reinforcement Learning Framework for Code Verification via LLM Prompt Repair. InProceedings of the Great Lakes Symposium on VLSI 2025, GLSVLSI 2025, New Orleans, LA, USA, 30 June 2025 - 2 July 2025, Lu Peng, Boris Vaisband, Fan Chen, Peipei Zhou, Shahar Kvatinsky, and Jiafeng Xie (...
-
[25]
Richard A. Kemmerer. 1985. Testing Formal Specifications to Detect Design Errors.IEEE Trans. Software Eng.11, 1 (1985), 32–43. doi:10.1109/TSE.1985.231535
-
[26]
Robert Könighofer, Georg Hofferek, and Roderick Bloem. 2009. Debugging formal specifications using simple counterstrategies. InProceedings of 9th International Conference on Formal Methods in Computer-Aided Design, FMCAD 2009, 15-18 November 2009, Austin, Texas, USA. IEEE, 152–159. doi:10.1109/FMCAD.2009. 5351127
-
[27]
Shuvendu K. Lahiri. 2024. Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages. InFormal Methods in Computer-Aided Design, FMCAD 2024, Prague, Czech Republic, October 15-18, 2024, Nina Narodytska and Philipp Rümmer (Eds.). IEEE, 142–147. doi:10.34727/2024/ISBN.978-3-85448-065- 5_19
-
[28]
Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel. 2023. Verus: Verifying Rust Programs using Linear Ghost Types.Proc. ACM Program. Lang.7, OOPSLA1 (2023), 286–315. doi:10.1145/3586037
-
[29]
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu- Hong Hoi. 2022. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. InAdvances in Neural Informa- tion Processing Systems 35: Annual Conference on Neural Information Process- ing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 2...
2022
-
[30]
K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InLogic for Programming, Artificial Intelligence, and Reasoning - 16th International Conference, LPAR-16, Dakar, Senegal, April 25-May 1, 2010, Revised Selected Papers (Lecture Notes in Computer Science), Edmund M. Clarke and Andrei Voronkov (Eds.). Springer, 348–37...
-
[31]
Shaoying Liu. 2016. Validating formal specifications using testing-based specifi- cation animation. InProceedings of the 4th FME Workshop on Formal Methods in Software Engineering, FormaliSE@ICSE 2016, Austin, Texas, USA, May 15, 2016. ACM, 29–35. doi:10.1145/2897667.2897668
-
[32]
Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. 2025. DafnyBench: A Benchmark for Formal Software Verifica- tion.Trans. Mach. Learn. Res.2025 (2025). https://openreview.net/forum?id= yBgTVWccIx
2025
-
[33]
T. L. McCluskey, J. M. Porteous, M. M. West, and C. H. Bryant. 1996. The Validation of Formal Specifications of Requirements.Formal Aspects of Computing 8 (1996), 89–107. doi:10.14236/ewic/FA1996.14
-
[34]
Bertrand Meyer. 1992. Applying "Design by Contract".Computer25, 10 (1992), 40–51. doi:10.1109/2.161279
-
[35]
MiniMax AI. 2025. MiniMax-M2: A Mini Model Built for Max Coding & Agentic Workflows. https://github.com/MiniMax-AI/MiniMax-M2
2025
-
[36]
Martin Mirchev, Andreea Costea, Abhishek Kr Singh, and Abhik Roychoudhury
-
[37]
arXiv:2410.18494 doi:10.48550/ARXIV.2410.18494
Assured Automatic Programming via Large Language Models.CoRR abs/2410.18494 (2024). arXiv:2410.18494 doi:10.48550/ARXIV.2410.18494
- [38]
-
[39]
Ng, Daishi Harada, and Stuart Russell
Andrew Y. Ng, Daishi Harada, and Stuart Russell. 1999. Policy Invariance Un- der Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999, Ivan Bratko and Saso Dzeroski (Eds.). Morgan Kaufmann, 278–287
1999
-
[40]
OpenAI. 2025. Introducing GPT-5.1 for developers. https://openai.com/index/gpt- 5-1-for-developers/
2025
-
[41]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Train- ing language models to follow instructions with h...
2022
-
[42]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024). arXiv:2402.03300 doi:10.48550/ARXIV.2402.03300
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
-
[43]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. HybridFlow: A Flexible and Efficient RLHF Framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025. ACM, 1279–1297. doi:10.1145/3689031.3696075
-
[44]
Chuyue Sun, Ying Sheng, Oded Padon, and Clark W. Barrett. 2024. Clover: Closed-Loop Verifiable Code Generation. InAI Verification - First International Symposium, SAIV 2024, Montreal, QC, Canada, July 22-23, 2024, Proceedings (Lec- ture Notes in Computer Science), Guy Avni, Mirco Giacobbe, Taylor T. Johnson, Guy Katz, Anna Lukina, Nina Narodytska, and Chr...
-
[45]
Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025). arXiv:2505.09388 doi:10.48550/ARXIV.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[46]
Amitayush Thakur, Yeming Wen, and Swarat Chaudhuri. 2023. A Language- Agent Approach to Formal Theorem-Proving.CoRRabs/2310.04353 (2023). arXiv:2310.04353 doi:10.48550/ARXIV.2310.04353 11 Zhechong Huang, Zhao Zhang, Zeyu Sun, Huifeng Sun, and Yingfei Xiong
-
[47]
Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2025. RLCoder: Reinforcement Learning for Repository-Level Code Completion. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 1140–1152. doi:10. 1109/ICSE55347.2025.00014
-
[48]
Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. 2025. Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning.CoRRabs/2506.03136 (2025). arXiv:2506.03136 doi:10.48550/ARXIV.2506.03136
-
[49]
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang
-
[50]
arXiv preprint arXiv:2502.18449 , year=
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution.CoRRabs/2502.18449 (2025). arXiv:2502.18449 doi:10.48550/ARXIV.2502.18449
-
[51]
Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. 2024. Enchanting Program Specification Synthesis by Large Language Models Using Static Analysis and Program Ver- ification. InComputer Aided Verification - 36th International Conference, CA V 2024, Montreal, QC, Canada, July 24-27, 2024, Proceedi...
-
[52]
Smith, Mari Ostendorf, and Hannaneh Hajishirzi
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Am- manabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2023. Fine- Grained Human Feedback Gives Better Rewards for Language Model Training. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023...
2023
-
[53]
Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan
-
[54]
InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforce- ment Learning and Monte-Carlo Tree Search. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=I4YAIwrsXa
2025
-
[55]
Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, and Jie Fu. 2025. Re:Form - Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny.CoRRabs/2507.16331 (2025). arXiv:2507.16...
-
[56]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
-
[57]
Lahiri, Jacob R
Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Weidong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu K. Lahiri, Jacob R. Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu. 2025. AutoVerus: Automated Proof Generation for Rust Code.Proc. ACM Program. Lang.9, OOPSLA2 (2025), 3454–
2025
-
[58]
Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J
Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J. Prenger, and Animashree Anandkumar. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 2023, New Orlean...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.