arxiv: 2604.18254 · v1 · submitted 2026-04-20 · 💻 cs.AI · cs.DB· cs.SE

Recognition: unknown

LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL

Ismail Berrada, Saad Ezzini, Salmane Chafik

Pith reviewed 2026-05-10 03:59 UTC · model grok-4.3

classification 💻 cs.AI cs.DBcs.SE

keywords curriculum learningadapter compositionText-to-SQLcode generationlarge language modelsmodular trainingSpider benchmarkBIRD benchmark

0 comments

The pith

Sequentially training tier-specific adapters on increasing query complexity improves complex Text-to-SQL performance over standard fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether curriculum learning helps large language models translate natural language into complex SQL code. Simply sorting training examples by difficulty in one pass does not work because earlier skills are forgotten. Instead, the authors train separate adapter modules one after another on easy, medium, hard, and extra-hard examples, then combine the modules when answering new questions. This modular setup yields better results on the hardest queries in the Spider and BIRD benchmarks while letting users pick only the adapters needed for a given database schema. The approach matters because real databases often contain noisy structures and deeply nested queries that defeat current monolithic models.

Core claim

By sequentially training tier-specific adapters on incremental complexity levels (Easy to Extra-Hard), the Modular Adapter Composition strategy creates a scaffolded learning environment that improves performance on complex queries, delivers measurable gains on Spider and BIRD, and supplies a flexible Lego-like architecture for composing models according to schema difficulty requirements.

What carries the argument

Modular Adapter Composition (MAC), a method of training separate adapters sequentially on complexity tiers and combining them at inference time to handle varying query difficulties without catastrophic forgetting.

Load-bearing premise

The separately trained adapters on different complexity tiers can be combined at inference time without interference or loss of earlier capabilities.

What would settle it

If a model assembled from the tier-specific adapters scores no higher, or lower, on complex queries than a single model fine-tuned on the full mixed dataset, the modular curriculum claim would not hold.

Figures

Figures reproduced from arXiv: 2604.18254 by Ismail Berrada, Saad Ezzini, Salmane Chafik.

**Figure 1.** Figure 1: 4 Evaluation In this section, we evaluate the predefined methods described in Section 3. 4.1 Research Questions We aim to address the following research questions: (1) RQ1: Does Curriculum Learning perform better than LoRA fine-tuning? In this research question, we investigate whether curriculum learning strategies outperform LoRA fine-tuning. (2) RQ2:What additional benefits does multi-adapter–based curri… view at source ↗

**Figure 1.** Figure 1: Multi-adapter Curriculum finetuning EX1: In this experiment, we fine-tune Qwen2.5-Coder-1.5B [15], a recent coding LLM that demonstrates strong performance across multiple programming tasks, particularly text-to-SQL, using the three fine-tuning strategies detailed in Section 3. First, we employ LoRA fine-tuning, allowing the dataset to be shuffled without considering the order of samples, as in standard … view at source ↗

**Figure 2.** Figure 2: LeGo-Code during Inference EX2: In this experiment, we evaluate multiple combinations of the adapters obtained from the previous experiment to investigate how LeGo-Code can be applied across different levels of task complexity and to determine which adapter combinations perform best for each level as depicted in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Execution accuracy comparison of various adapters [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Recently, code-oriented large language models (LLMs) have demonstrated strong capabilities in translating natural language into executable code. Text-to-SQL is a significant application of this ability, enabling non-technical users to interact with relational databases using natural language. However, state-of-the-art models continue to struggle with highly complex logic, particularly deeply nested statements involving multiple joins and conditions, as well as with real-world database schemas that are noisy or poorly structured. In this paper, we investigate whether curriculum learning can improve the performance of code-based LLMs on Text-to-SQL tasks. Employing benchmarks including Spider and BIRD, we fine-tune models under different curriculum strategies. Our experiments show that naive curriculum, simply ordering training samples by complexity in a single epoch, fails to surpass standard fine-tuning due to catastrophic forgetting. To overcome this, we propose a Modular Adapter Composition (MAC) strategy. By sequentially training tier-specific adapters on incremental complexity levels (Easy to Extra-Hard), we create a scaffolded learning environment that improves performance on complex queries. Our approach not only produces measurable performance gains on the Spider and BIRD benchmarks but also provides a flexible, "Lego-like" architecture, allowing models to be composed and deployed based on specific schema difficulty requirements. These findings demonstrate that structured, modular learning is a superior alternative to monolithic fine-tuning for mastering the syntax and logic of complex code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's tiered adapter curriculum for Text-to-SQL is a reasonable attempt to avoid forgetting, but it still needs ablations to show the modularity itself drives any gains.

read the letter

The main point is that they train separate adapters on Easy through Extra-Hard complexity tiers for Text-to-SQL and then compose them at inference, claiming this beats both standard fine-tuning and a naive curriculum that just orders samples by difficulty. The latter fails due to forgetting, which is a known issue in sequential training and matches what the abstract describes. The modular setup is presented as a flexible Lego-style way to handle different schema difficulties without losing earlier capabilities. That framing is clear and directly targets a real pain point in code generation for nested queries and messy databases. The work also pulls in the standard Spider and BIRD benchmarks, which is the right place to test this. What is actually new is the specific combination of tier-specific adapters with sequential training and composition for this task, extending prior adapter and curriculum ideas in a composable way. The paper does a decent job motivating why monolithic fine-tuning struggles on complex logic and why simple ordering does not fix it. The soft spots are in the evidence. The abstract mentions measurable gains but supplies no numbers, baselines, variance, or statistical tests, so it is impossible to judge effect size or reliability. More importantly, there is no ablation comparing the full composition against a single adapter trained only on the hardest tier, and the composition operator itself is not described. If the lower-tier adapters add nothing or interfere on hard examples, the claimed scaffolded benefit disappears and the results reduce to ordinary fine-tuning on difficult data. Tier thresholds also appear to be free parameters without clear justification or sensitivity analysis. The citation pattern is standard and covers the relevant adapter and Text-to-SQL literature without obvious omissions. This is for people working on practical improvements to LLM code generation for structured outputs like SQL. A reader already experimenting with LoRA or continual learning might borrow the tiered idea, but the current version is too high-level to adopt directly. I would send it for peer review because the proposal is concrete and the problem matters, though referees will need to see the missing ablations, exact metrics, and composition details before it can be taken as a solid result.

Referee Report

4 major / 1 minor

Summary. The paper investigates whether curriculum learning can improve code LLMs on Text-to-SQL tasks using Spider and BIRD benchmarks. It reports that naive curriculum (ordering samples by complexity in one epoch) fails to beat standard fine-tuning due to catastrophic forgetting. It proposes Modular Adapter Composition (MAC): sequentially training tier-specific adapters on incremental complexity levels (Easy to Extra-Hard) to create a scaffolded environment that yields measurable gains on complex queries while enabling flexible 'Lego-like' composition at inference based on schema difficulty.

Significance. If the empirical claims hold with proper controls, MAC could demonstrate a practical modular alternative to monolithic fine-tuning for mastering complex code generation, potentially mitigating forgetting while allowing deployment flexibility. The approach aligns with adapter-based efficiency trends and could generalize to other structured generation tasks if the composability benefit is isolated.

major comments (4)

[Abstract] Abstract: the claim of 'measurable performance gains' on Spider and BIRD is unsupported by any reported metrics, baselines, deltas, statistical tests, or exact numbers; without these the central empirical claim cannot be evaluated.
[Method] Method (MAC description): the central claim that tier-specific adapters are composable and that their combination improves complex-query performance requires an explicit definition of the composition operator (additive merging, gating, selection, etc.) and an ablation comparing the full MAC composition against the Extra-Hard adapter alone; neither is described.
[Experiments] Experiments: no evidence is provided that lower-tier adapters contribute positively (rather than neutrally or negatively) when combined on Extra-Hard examples, nor is there a comparison to monolithic fine-tuning on the full dataset; this leaves open the possibility that gains are attributable solely to the final tier.
[Abstract] Abstract / § on tier definition: complexity tiers (Easy to Extra-Hard) are central to the curriculum but no quantitative criteria, heuristics, or measurement procedure for assigning samples to tiers is stated, undermining reproducibility and the claim of structured scaffolding.

minor comments (1)

[Abstract] Abstract: the phrase 'Lego-like' architecture is evocative but should be replaced or supplemented by a precise description of the inference-time composition rule.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important areas for improving the clarity, rigor, and reproducibility of our work. We have revised the manuscript to fully address each major comment by adding the requested quantitative results, methodological definitions, ablations, and tier criteria.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'measurable performance gains' on Spider and BIRD is unsupported by any reported metrics, baselines, deltas, statistical tests, or exact numbers; without these the central empirical claim cannot be evaluated.

Authors: We agree that the abstract must be self-contained and quantitatively supported. In the revised manuscript, we have updated the abstract to explicitly report key metrics, including exact accuracy improvements (with deltas) on complex queries for MAC versus standard fine-tuning and naive curriculum on both Spider and BIRD. We also reference the full experimental tables and note the use of statistical tests (e.g., paired significance tests) to substantiate the gains. revision: yes
Referee: [Method] Method (MAC description): the central claim that tier-specific adapters are composable and that their combination improves complex-query performance requires an explicit definition of the composition operator (additive merging, gating, selection, etc.) and an ablation comparing the full MAC composition against the Extra-Hard adapter alone; neither is described.

Authors: We thank the referee for this observation. The revised Method section now provides an explicit definition of the composition operator as a dynamic, schema-aware selection followed by additive merging of the relevant tier adapters at inference time. We have also added a dedicated ablation study directly comparing full MAC composition against the Extra-Hard adapter alone, confirming that the modular combination improves performance on complex queries. revision: yes
Referee: [Experiments] Experiments: no evidence is provided that lower-tier adapters contribute positively (rather than neutrally or negatively) when combined on Extra-Hard examples, nor is there a comparison to monolithic fine-tuning on the full dataset; this leaves open the possibility that gains are attributable solely to the final tier.

Authors: We have incorporated new experimental results to address this directly. The revised Experiments section includes controlled ablations demonstrating that lower-tier adapters contribute positively (with measurable gains) when composed for Extra-Hard examples. We have also added a head-to-head comparison against monolithic fine-tuning on the full dataset, showing MAC's advantages on complex queries while mitigating forgetting. revision: yes
Referee: [Abstract] Abstract / § on tier definition: complexity tiers (Easy to Extra-Hard) are central to the curriculum but no quantitative criteria, heuristics, or measurement procedure for assigning samples to tiers is stated, undermining reproducibility and the claim of structured scaffolding.

Authors: We apologize for the missing details on tier assignment. The revised manuscript adds a new subsection in Methods that specifies the quantitative criteria and procedure: tiers are assigned using a combination of query features (number of joins, nesting depth, number of conditions) and schema features (number of tables, foreign-key density), with explicit thresholds and heuristics derived from parse-tree analysis. This is now fully described for both Spider and BIRD to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external benchmarks

full rationale

The paper introduces a procedural training strategy (sequential tier-specific LoRA adapters) and reports its performance on standard held-out benchmarks (Spider, BIRD). No equations, fitted parameters, or self-citations are used to derive the claimed gains; improvements are measured directly via execution accuracy on test queries after training. The central claim does not reduce by construction to quantities defined from the same data used to label complexity tiers, nor does any load-bearing step rely on prior self-citations that themselves assume the result. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim depends on the empirical effectiveness of the newly introduced MAC procedure and the assumption that complexity tiers can be defined in a way that enables successful sequential adapter training.

free parameters (1)

Complexity tier thresholds
Definition of Easy/Medium/Hard/Extra-Hard query levels must involve some criteria or cutoffs chosen by the authors.

axioms (1)

domain assumption Naive single-pass curriculum ordering causes catastrophic forgetting in LLM fine-tuning for Text-to-SQL.
Invoked to explain why standard curriculum fails and to motivate the modular approach.

invented entities (1)

Tier-specific adapters no independent evidence
purpose: Specialized modules trained on incremental complexity levels that can be composed at inference time.
New component introduced by the MAC strategy; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5556 in / 1324 out tokens · 49566 ms · 2026-05-10T03:59:46.781826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 7 canonical work pages · 4 internal anchors

[1]

2026.Claude Code

Anthropic. 2026.Claude Code. https://claude.com/product/claude-code

2026
[2]

2026.Cursor: The best way to code with AI.https://cursor.com

Anysphere. 2026.Cursor: The best way to code with AI.https://cursor.com

2026
[3]

Daria Bakshandaeva, Oleg Somov, Ekaterina Dmitrieva, Vera Davydova, and Elena Tutubalina. 2022. PAUQ: Text-to-SQL in Russian. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, 2355–2376

2022
[4]

Ben Bogin, Jonathan Berant, and Matt Gardner. 2019. Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, 4560–4565

2019
[5]

Ruichu Cai, Jinjie Yuan, Boyan Xu, and Zhifeng Hao. 2021. SADGA: Structure- Aware Dual Graph Aggregation Network for Text-to-SQL. InAdvances in Neural Information Processing Systems, Vol. 34. 7664–7676

2021
[6]

Salmane Chafik, Saad Ezzini, and Ismail Berrada. 2025. Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija. In Proceedings of the 4th Workshop on Arabic Corpus Linguistics (W ACL-4). 86–92

2025
[7]

Salmane Chafik, Saad Ezzini, and Ismail Berrada. 2025. Towards Automating Domain-Specific Data Generation for Text-to-SQL: A Comprehensive Approach. ACM Transactions on Software Engineering and Methodology(2025)

2025
[8]

Salmane Chafik, Saad Ezzini, and Ismail Berrada. 2026. DarijaDB: Unlocking Text-to-SQL for Arabic Dialects.ACM Transactions on Asian and Low-Resource Language Information Processing(2026)

2026
[9]

Li Dong and Mirella Lapata. 2018. Coarse-to-Fine Decoding for Neural Semantic Parsing. InProceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 731–742

2018
[10]

Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen Zhan, and Jian-Guang Lou. 2022. MultiSpider: Towards Benchmarking Multilin- gual Text-to-SQL Semantic Parsing. arXiv:2212.13492 (Dec. 2022)

work page arXiv 2022
[11]

Kummerfeld, Li Zhang, Karthik Ra- manathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev

Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ra- manathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev. 2018. Improving Text-to-SQL Evaluation Methodology. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 351–360

2018
[12]

Jiaqi Guo, Ziliang Si, Yu Wang, Qian Liu, Ming Fan, Jian-Guang Lou, Zijiang Yang, and Ting Liu. 2021. Chase: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language ...

2021
[13]

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, 4524–4535. LeGo-Code: Can Modular Curriculum Learning Advance Compl...

2019
[14]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. InInternational conference on machine learning. PMLR, 2790–2799

2019
[15]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review arXiv 2024
[16]

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin C. C. Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. arXiv:2305.03111 (2023)

work page arXiv 2023
[17]

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation.arXiv preprint arXiv:2402.19173(2024)

work page internal anchor Pith review arXiv 2024
[18]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review arXiv 2023
[19]

Ohad Rubin and Jonathan Berant. 2021. SmBoP: Semi-autoregressive Bottom-up Semantic Parsing. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies. Online, 311–324

2021
[20]

Yewei Song, Saad Ezzini, Xunzhu Tang, Cedric Lothritz, Jacques Klein, Tegawendé Bissyandé, Andrey Boytsov, Ulrick Ble, and Anne Goujon. 2024. Enhancing text-to-sql translation for financial system design. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. 252–262

2024
[21]

Ruoxi Sun, Sercan Arik, Rajarishi Sinha, Hootan Nakhost, Hanjun Dai, Pengcheng Yin, and Tomas Pfister. 2023. Sqlprompt: In-context text-to-sql with minimal labeled data. InFindings of the Association for Computational Linguistics: EMNLP

2023
[22]

Chang-Yu Tai, Ziru Chen, Tianshu Zhang, Xiang Deng, and Huan Sun. 2023. Exploring chain of thought style prompting for text-to-sql. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 5376–5393

2023
[23]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[24]

Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online, 7567–7578

2020
[25]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2019. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887 (2019)

work page arXiv 2019
[26]

Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generat- ing Structured Queries from Natural Language using Reinforcement Learning. arXiv:1709.00103 (2017)

work page internal anchor Pith review arXiv 2017
[27]

Xiaohu Zhu, Qian Li, Lizhen Cui, and Yuntao Du. 2025. Learning SQL Like a Human: Structure-Aware Curriculum Learning for Text-to-SQL Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025. 3545–3559

2025