arxiv: 2604.13045 · v1 · submitted 2026-03-11 · 💻 cs.DB

Recognition: no theorem link

Draft-Refine-Optimize: Self-Evolved Learning for Natural Language to MongoDB Query Generation

Mingwei Ye , Jiaxi Zhuang , Mingjun Xu , Linfeng Zhang , Guolin Ke , Hengxing Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:12 UTC · model grok-4.3

classification 💻 cs.DB

keywords NL2MQLself-evolved learningDraft-Refine-OptimizeMongoDB query generationexecution-driven optimizationnatural language to querypolicy optimizationiterative refinement

0 comments

The pith

A self-evolved framework using Draft-Refine-Optimize cycles reaches 83.1 percent accuracy on out-of-distribution MongoDB query benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoMQL, a framework that converts natural language into MongoDB queries by running repeated cycles of drafting a candidate query, retrieving evidence to resolve schema ambiguities, refining the output, and optimizing the model with rewards from actually running the query. This closed loop lets the system build compact contexts for nested pipelines and ambiguous values that static prompting methods cannot handle well. A sympathetic reader would care because the approach shows how a model can improve itself over time using execution feedback rather than relying on fixed prompts or one-time refinement. The results report gains over open-source baselines on both familiar and unfamiliar benchmarks while using only 3 billion activated parameters.

Core claim

EvoMQL unifies evidence-grounded context construction with execution-driven learning through iterative Draft-Refine-Optimize cycles. Draft queries trigger query-aware retrieval to build compact evidence that grounds nested paths and resolves ambiguities. The model then receives online policy optimization driven by execution-based rewards under curriculum scheduling, and the refined model is fed back into the next cycle to produce progressive improvement.

What carries the argument

The Draft-Refine-Optimize (DRO) cycle that uses draft queries to retrieve evidence contexts and applies execution rewards for policy optimization.

If this is right

The method outperforms the strongest open-source baselines by up to 9.5 percent on in-distribution tasks and 5.2 percent on out-of-distribution tasks.
It reaches 76.6 percent execution accuracy on the EAI benchmark and 83.1 percent on the TEND benchmark.
Only 3 billion activated parameters suffice for the closed-loop improvement process.
The same paradigm supports scalable, continuous improvement of NL2MQL systems in production settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cycle structure could be tested on other procedural query languages that involve nested operations.
Production deployments might accumulate gains by running DRO cycles on live user queries without additional labeled data.
Curriculum scheduling may reduce sensitivity to reward noise compared with static reinforcement learning setups.
Limits would appear if execution feedback becomes unreliable for very long pipelines or highly ambiguous value references.

Load-bearing premise

Execution-based rewards from running queries supply stable and unbiased signals that support reliable policy optimization without instability or overfitting to benchmark patterns.

What would settle it

A new benchmark with deeper nesting or schema distributions outside the EAI and TEND sets shows accuracy falling below the strongest open-source baselines or exhibits training instability.

Figures

Figures reproduced from arXiv: 2604.13045 by Guolin Ke, Hengxing Cai, Jiaxi Zhuang, Linfeng Zhang, Mingjun Xu, Mingwei Ye.

**Figure 1.** Figure 1: Overall Framework of EvoMQL. 3.3 Refinement 3.3.1 Schema Linking. Introducing excessive or irrelevant schema elements into the prompt can significantly degrade generation quality [17, 20]. First, it increases the likelihood that the model will reference irrelevant collections or fields during query generation. Second, providing the full database schema often results in overly long prompts, which may exc… view at source ↗

**Figure 2.** Figure 2: Performance of online curriculum learning over cumulative training steps. We report COF (left) and OPS (right) on EAI (top) and TEND (bottom). Solid lines denote the three iterations of the evolved model, while dashed lines indicate the one-epoch static baseline. all models are evaluated under an identical inference setting, where no refinement augmentation (i.e., Draft MQL) is applied. As shown in [PITH… view at source ↗

read the original abstract

Natural Language to MongoDB Query Language (NL2MQL) is essential for democratizing access to modern document-centric databases. Unlike Text-to-SQL, NL2MQL faces unique challenges from MQL's procedural aggregation pipelines, deeply nested schemas, and ambiguous value grounding. Existing approaches use static prompting or one-shot refinement, which inadequately model these complex contexts and fail to systematically leverage execution feedback for persistent improvement. We propose EvoMQL, a self-evolved framework that unifies evidence-grounded context construction with execution-driven learning through iterative Draft-Refine-Optimize (DRO) cycles. Each cycle uses draft queries to trigger query-aware retrieval, dynamically building compact evidence contexts that resolve schema ambiguities and ground nested paths to concrete values. The model then undergoes online policy optimization with execution-based rewards and curriculum scheduling, with refined models feeding back into subsequent cycles for progressive evolution. Overall, EvoMQL achieves state-of-the-art execution accuracy of 76.6% on the EAI in-distribution benchmark and 83.1% on the TEND out-of-distribution benchmark, outperforming the strongest open-source baselines by up to 9.5% and 5.2%, respectively. With only 3B activated parameters, this closed-loop paradigm enables scalable, continuous improvement of NL2MQL systems in production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoMQL combines retrieval and execution-driven optimization in DRO loops for NL2MQL, but the abstract leaves the gains hard to verify.

read the letter

The key takeaway is that EvoMQL uses iterative Draft-Refine-Optimize cycles to build better natural language to MongoDB query models, and it claims state-of-the-art results on two benchmarks. The new part is bringing together query-aware retrieval for context and then applying online policy optimization with execution rewards in repeated loops. This goes past simple prompting or single refinements by letting the model evolve based on actual query runs. It handles the tricky parts of MongoDB like nested pipelines and schema ambiguities through dynamic evidence building. The paper does a decent job highlighting why NL2MQL is harder than text-to-SQL and shows measurable improvements: 76.6 percent accuracy on the EAI set and 83.1 on TEND, with gains up to 9.5 percent over open-source baselines using a small 3B parameter model. The closed-loop idea for continuous improvement in production sounds practical. That said, the abstract gives almost no experimental details. There are no descriptions of the baselines, no error analysis, no ablations to show what the DRO cycles add, and no checks for statistical significance. This makes it tough to judge if the rewards from execution are stable or if the model is just fitting the specific benchmark patterns, as the stress-test note suggests. The out-of-distribution results help, but without more on how they controlled for overfitting or query diversity, the claims feel under-supported. This work is aimed at people building query interfaces for document databases. A reader interested in practical NL2DB systems would get value from the framework description, even if the numbers need verification. I would recommend sending it to peer review. The core idea has potential, but the reviewers will need to see the full methods and results to confirm the improvements are real and generalizable.

Referee Report

3 major / 2 minor

Summary. The paper proposes EvoMQL, a self-evolved framework for natural language to MongoDB query (NL2MQL) generation. It unifies evidence-grounded context construction via query-aware retrieval with execution-driven online policy optimization inside iterative Draft-Refine-Optimize (DRO) cycles. The model drafts queries, retrieves compact evidence to resolve schema and value ambiguities, refines via execution feedback, and optimizes the policy with curriculum scheduling; refined models feed back into subsequent cycles. The central empirical claim is state-of-the-art execution accuracy of 76.6% on the EAI in-distribution benchmark and 83.1% on the TEND out-of-distribution benchmark, outperforming the strongest open-source baselines by up to 9.5% and 5.2% respectively, using a 3B-parameter model.

Significance. If the reported gains are robust, the work would be significant for NL2MQL because it directly tackles MQL-specific difficulties (deeply nested aggregation pipelines, schema ambiguity, value grounding) through closed-loop execution feedback rather than static prompting. The self-evolution mechanism could enable continuous improvement in production settings. However, the significance is tempered by the absence of ablations isolating the contribution of iterative DRO from simple execution feedback and by the risk that execution rewards overfit to the fixed benchmark distributions rather than learning generalizable MQL generation.

major comments (3)

[§5] §5 (Experiments) and §5.3 (Ablation studies): the SOTA claims of 76.6% EAI and 83.1% TEND rest on online policy optimization with execution-based rewards, yet no ablation isolates the effect of self-evolution cycles from one-shot execution feedback; without this, it is impossible to determine whether gains arise from genuine policy improvement or from repeated tuning to the fixed EAI/TEND query patterns.
[§4.3] §4.3 (Reward design) and §5.4 (Stability analysis): the paper does not report regularization, reward shaping, or diversity controls on the execution reward signal despite MQL's nested pipelines and value-grounding ambiguities; this leaves open the possibility that the observed improvements reflect overfitting rather than stable generalization, especially on the out-of-distribution TEND benchmark.
[Table 2] Table 2 and §5.1: the reported improvements (up to 9.5% and 5.2%) are given without statistical significance tests, variance across runs, or error analysis broken down by query complexity (e.g., depth of nesting or number of value groundings); these omissions make it difficult to assess whether the gains are reliable or load-bearing for the central claim.

minor comments (2)

[§1] The abstract and §1 cite EAI and TEND benchmarks without providing their sizes, construction methodology, or public availability; this should be clarified for reproducibility.
[§4.2] Notation for the policy optimization objective in §4.2 is introduced without an explicit equation; adding a numbered equation would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will make revisions to improve the empirical rigor of the work.

read point-by-point responses

Referee: [§5] §5 (Experiments) and §5.3 (Ablation studies): the SOTA claims of 76.6% EAI and 83.1% TEND rest on online policy optimization with execution-based rewards, yet no ablation isolates the effect of self-evolution cycles from one-shot execution feedback; without this, it is impossible to determine whether gains arise from genuine policy improvement or from repeated tuning to the fixed EAI/TEND query patterns.

Authors: We agree that a direct ablation separating iterative self-evolution from one-shot execution feedback is necessary to substantiate the contribution of the DRO cycles. The manuscript presents the full iterative framework but does not include this specific comparison. In the revised version we will add an ablation study contrasting the complete multi-cycle EvoMQL against a single-cycle baseline that applies execution feedback only once. This will clarify whether observed gains derive from progressive policy improvement or from repeated exposure to the same benchmark distributions. revision: yes
Referee: [§4.3] §4.3 (Reward design) and §5.4 (Stability analysis): the paper does not report regularization, reward shaping, or diversity controls on the execution reward signal despite MQL's nested pipelines and value-grounding ambiguities; this leaves open the possibility that the observed improvements reflect overfitting rather than stable generalization, especially on the out-of-distribution TEND benchmark.

Authors: The reward in §4.3 is defined as binary execution success with syntax-error penalties. The policy optimization objective already includes a KL-divergence term for regularization, yet we did not explicitly discuss reward shaping or diversity controls. We will expand §4.3 to detail these mechanisms and augment §5.4 with diversity metrics (e.g., query-structure entropy) and additional experiments applying reward shaping to address value-grounding ambiguities. These additions will better demonstrate stability on the TEND out-of-distribution set. revision: yes
Referee: [Table 2] Table 2 and §5.1: the reported improvements (up to 9.5% and 5.2%) are given without statistical significance tests, variance across runs, or error analysis broken down by query complexity (e.g., depth of nesting or number of value groundings); these omissions make it difficult to assess whether the gains are reliable or load-bearing for the central claim.

Authors: We acknowledge that the current presentation lacks statistical tests, run-to-run variance, and complexity-stratified error analysis. In the revision we will re-run all experiments with multiple random seeds, report means and standard deviations, and include paired statistical significance tests. We will also add a dedicated error-analysis subsection in §5.1 that breaks down failures according to nesting depth and number of value groundings, thereby providing a clearer assessment of where the gains are most reliable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external execution feedback

full rationale

The paper presents EvoMQL as an iterative framework that constructs evidence contexts from draft queries and applies online policy optimization using execution-based rewards from running the generated MongoDB queries. This chain depends on external signals (query execution results) rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, uniqueness theorems, or ansatzes are shown that reduce the claimed accuracy gains to the inputs by construction. The reported SOTA numbers are framed as empirical outcomes of the closed-loop process, not tautological restatements of the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; framework appears to build on standard reinforcement learning and retrieval techniques without new postulates.

pith-pipeline@v0.9.0 · 5554 in / 1089 out tokens · 51110 ms · 2026-05-15T13:12:44.266057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

[1]

MongoDB Education AI. 2025. Natural Language to MongoDB Shell (mongosh) Benchmark Dataset.https://huggingface.co/datasets/ mongodb-eai/natural-language-to-mongosh. Accessed: 2025-10-29

work page 2025
[2]

Adithya Bhaskar, Tushar Tomar, Ashutosh Sathe, and Sunita Sarawagi

work page
[3]

InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing

Benchmarking and Improving Text-to-SQL Generation under Ambiguity. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, 7053–7074. doi:10.18653/v1/2023.emnlp-main.436

work page doi:10.18653/v1/2023.emnlp-main.436 2023
[4]

Ursin Brunner and Kurt Stockinger. 2021. ValueNet: A Natural Language-to-SQL System that Learns from Database Information. In2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2177–2182. doi:10.1109/ICDE51399.2021.00220

work page doi:10.1109/icde51399.2021.00220 2021
[5]

Yaxun Dai, Wenxuan Xie, Xialie Zhuang, Tianyu Yang, Yiying Yang, Haiqin Yang, Yuhang Zhao, Pingfu Chao, and Wenhao Jiang. 2025. ReEx-SQL: Reasoning with Execution-Aware Reinforcement Learning for Text-to-SQL. arXiv:2505.12768https://arxiv.org/abs/2505.12768

work page arXiv 2025
[6]

Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Olek- sandr Polozov, Huan Sun, and Matthew Richardson. 2021. Structure- Grounded Pretraining for Text-to-SQL. InProceedings of the 2021 Con- ference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies. Online, 1337–1350. doi:10.18653/v1/2021.naa...

work page doi:10.18653/v1/2021.naacl-main.105 2021
[7]

Huerta, and Hao Peng

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sra- van Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A. Huerta, and Hao Peng. 2025. Context Length Alone Hurts LLM Per- formance Despite Perfect Retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025.https://aclanthology.org/ 2025.findings-emnlp.1264/

work page 2025
[8]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yuntao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li. 2024. A Preview of XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL.arXiv preprint arXiv:2411.08599(2024). https://arxiv.org/abs/2411.08599

work page arXiv 2024
[9]

Mingqian He, Yongliang Shen, Wenqi Zhang, Qiuying Peng, Jun Wang, and Weiming Lu. 2025. STaR-SQL: Self-Taught Reasoner for Text-to- SQL. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria, 24365–24375. doi:10.18653/v1/2025.acl-long.1187

work page doi:10.18653/v1/2025.acl-long.1187 2025
[10]

Zezhou Huang, Pavan Kalyan Damalapati, and Eugene Wu. 2023. Data Ambiguity Strikes Back: How Documentation Improves GPT’s Text- to-SQL.arXiv preprint arXiv:2310.18742(2023)

work page arXiv 2023
[11]

Wenqiang Lei, Weixin Wang, Zhixin Ma, Tian Gan, Wei Lu, Min-Yen Kan, and Tat-Seng Chua. 2020. Re-examining the Role of Schema Linking in Text-to-SQL. InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing. Online, 6943–6954. doi:10.18653/v1/2020.emnlp-main.564

work page doi:10.18653/v1/2020.emnlp-main.564 2020
[12]

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, and Radha Poovendran

work page
[13]

Small Models Struggle to Learn from Strong Reasoners.arXiv preprint arXiv:2502.12143(2025).https://arxiv.org/abs/2502.12143

work page arXiv 2025
[14]

Liu, Kevin Lin, John Hewitt, Bhargavi Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Mid- dle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics12 (2024), 157–173. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[15]

Jinwei Lu, Yuanfeng Song, Zhiqian Qin, Haodi Zhang, Chen Zhang, and Raymond Chi-Wing Wong. 2025. Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to- NoSQL Translation.CoRRabs/2502.11201 (2025). arXiv:2502.11201 doi:10.48550/ARXIV.2502.11201

work page doi:10.48550/arxiv.2502.11201 2025
[16]

Jinwei Lu, Yuanfeng Song, Zhiqian Qin, Haodi Zhang, Chen Zhang, and Raymond Chi-Wing Wong. 2025. Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to- NoSQL Translation. arXiv:2502.11201https://arxiv.org/abs/2502.11201

work page arXiv 2025
[17]

Renjie Luo, Jiaxi Li, Chen Huang, and Wei Lu. 2025. Through the Valley: Path to Effective Long CoT Training for Small Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4972–

work page 2025
[18]

doi:10.18653/v1/2025.emnlp-main.251

work page doi:10.18653/v1/2025.emnlp-main.251 2025
[19]

Peixian Ma, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, and Jian Guo. 2025. Sql-r1: Training natural language to sql reasoning model by reinforcement learning.arXiv preprint arXiv:2504.08600 (2025)

work page arXiv 2025
[20]

Karime Maamari, Fadhil Abubaker, Daniel Jaroslawicz, and Amine Mhedhbi. 2024. The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models.arXiv preprint arXiv:2408.07702 (2024)

work page arXiv 2024
[21]

Zhiqian Qin, Yuanfeng Song, Jinwei Lu, Yuanwei Song, Shuaimin Li, and Chen Jason Zhang. 2025. MultiTEND: A Multilingual Benchmark for Natural Language to NoSQL Query Translation. InFindings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria, 24632–24657. doi:10.18653/v1/2025.findings-acl.1265

work page doi:10.18653/v1/2025.findings-acl.1265 2025
[22]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo

work page
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Zhili Shen, Pavlos Vougiouklis, Chenxin Diao, Kaustubh Vyas, Yuanyi Ji, and Jeff Z. Pan. 2024. Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 7865–7879. doi:10.18653/v1/2024.emnlp-main.449

work page doi:10.18653/v1/2024.emnlp-main.449 2024
[25]

Jie Shi, Bo Xu, Jiaqing Liang, Yanghua Xiao, Jia Chen, Chenhao Xie, Peng Wang, and Wei Wang. 2025. Gen-SQL: Efficient Text-to-SQL By Bridging Natural Language Question And Database Schema With Pseudo-Schema. InProceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE, 3794–3807.https: //aclanthology.org/2025.coling-main.256/

work page 2025
[26]

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. 2024. CHESS: Contextual Harnessing for Efficient SQL Synthesis.arXiv preprint arXiv:2405.16755(2024)

work page arXiv 2024
[27]

Vygotsky

Lev S. Vygotsky. 1978.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge, MA

work page 1978
[28]

Chenglong Wang, Kedar Tatwawadi, Marc Brockschmidt, Po-Sen Huang, Yi Mao, Oleksandr Polozov, and Rishabh Singh. 2018. Ro- bust Text-to-SQL Generation with Execution-Guided Decoding. arXiv:1807.03100https://arxiv.org/abs/1807.03100

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Yihan Wang, Peiyu Liu, and Xin Yang. 2025. LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to- SQL. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China, 977–991. doi:10.18653/ v1/2025.emnlp-main.51

work page 2025
[30]

Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, and Yisen Wang

work page
[31]

When More is Less: Understanding Chain-of-Thought Length in LLMs.arXiv preprint arXiv:2502.07266(2025).https://arxiv.org/abs/ 2502.07266

work page arXiv 2025
[32]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations.https://openreview.net/forum?id=WE_ vluYUL-X

work page 2023
[34]

Zhewei Yao, Guoheng Sun, Lukasz Borchmann, Zheyu Shen, Ming- hang Deng, Bohan Zhai, Hao Zhang, Ang Li, and Yuxiong He. 2025. Ye et al. Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to- SQL.arXiv preprint arXiv:2505.20315(2025)

work page arXiv 2025
[35]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al

work page
[36]

Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Zhang, and Lu Qianchun

Enci Zhang, Xingang Yan, Wei Lin, Tianxiang. Zhang, and Lu Qianchun. 2025. Learning Like Humans: Advancing LLM Reason- ing Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty...

work page doi:10.18653/v1/2025.emnlp-main.336 2025
[38]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. 2025. Group Sequence Policy Optimization. arXiv:2507.18071https://arxiv.org/abs/2507.18071 A Data Synthesis Pipeline To address the scarcity of high-quality, executable NL2MQL annotated data, we designed a clos...

work page internal anchor Pith review Pith/arXiv arXiv 2025