Recognition: no theorem link
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents
Pith reviewed 2026-05-12 01:16 UTC · model grok-4.3
The pith
CoCoDA co-evolves a planner with a compositional code DAG so an 8B model matches or exceeds a 32B model on GSM8K and MATH.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoCoDA maintains a single code-native DAG in which primitive tools and composite tools are nodes, invocation dependencies are edges, and each node records a typed signature, description, pre/post-condition specification, and worked examples. At inference, Typed DAG Retrieval uses symbolic unification to discard mismatched candidates early, then progressively applies cheaper filters before materializing full context. At training, each successful trajectory is folded into a new validated composite node, and the planner is updated with a DAG-induced reward that credits a composite in proportion to its primitive expansion size. Theoretical arguments establish retrieval-cost reduction, sublinear
What carries the argument
The compositional code DAG whose nodes are typed executable tools and whose reward is proportional to primitive expansion size.
Load-bearing premise
Successful trajectories can be folded into composite tools that preserve correctness and yield net compositional advantage under conservative updates.
What would settle it
An experiment in which the 8B student under CoCoDA scores lower than the 32B teacher on GSM8K or MATH, or in which measured retrieval cost grows linearly with library size.
Figures
read the original abstract
Tool-augmented language models can extend small language models with external executable skills, but scaling the tool library creates a coupled challenge: the library must evolve with the planner as new reusable subroutines emerge, while retrieval from the growing library must remain within a fixed context budget. Existing tool-use and skill-library methods typically treat tools as flat or text-indexed memories, causing prompt cost to grow with library size and obscuring the typed, compositional structure of executable code. We propose CoCoDA, a framework that co-evolves the planner and tool library through a single code-native structure: a compositional code DAG. Nodes are primitive or composite tools, edges encode invocation dependencies, and each node stores a typed signature, description, pre/post-condition specification, and worked examples. At inference time, Typed DAG Retrieval prunes candidates by symbolic signature unification, ranks survivors by descriptions, filters them by behavioral specifications, and disambiguates with examples, keeping expensive context materialization on progressively smaller candidate sets. At training time, successful trajectories are folded into validated composite tools, while the planner is updated with a DAG-induced reward that credits composites by their primitive expansion size. We provide theoretical results showing retrieval cost reduction, sublinear retrieval time, compositional advantage under the shaped reward, monotone co-evolution under conservative updates, and DAG well-formedness. Across mathematical reasoning, tabular analysis, and code task benchmarks, CoCoDA enables an 8B student to match or exceed a 32B teacher on GSM8K and MATH and consistently improves over strong tool-use and library-learning baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CoCoDA, a framework that co-evolves a planner and tool library for tool-augmented agents via a single compositional code DAG. Nodes represent primitive or composite tools with typed signatures, descriptions, pre/post-conditions, and examples; edges encode invocation dependencies. Typed DAG Retrieval prunes by symbolic signature unification, ranks by descriptions, filters by behavioral specs, and disambiguates via examples to keep retrieval efficient within fixed context. Training folds successful trajectories into validated composite tools and updates the planner with a DAG-induced reward that credits composites by primitive expansion size. Theoretical results are claimed on retrieval cost reduction, sublinear time, compositional advantage under the shaped reward, monotone co-evolution under conservative updates, and DAG well-formedness. Experiments report that an 8B student model matches or exceeds a 32B teacher on GSM8K and MATH while improving over tool-use and library-learning baselines across math, tabular, and code tasks.
Significance. If the theoretical results and experimental claims hold, this would represent a meaningful advance in scalable tool-augmented agents by addressing library growth through explicit compositional structure and co-evolution rather than flat indexing. The combination of symbolic pruning with behavioral filtering and the shaped reward for composition size could enable smaller models to achieve strong performance via reusable subroutines, with the claimed monotone co-evolution and sublinear retrieval offering practical benefits for long-term library maintenance. The provision of multiple theoretical results is a positive feature, though their impact depends on the rigor of the derivations.
major comments (1)
- [Abstract] Abstract: The claim that successful trajectories are 'folded into validated composite tools' that preserve correctness while supporting 'compositional advantage under the shaped reward' and 'monotone co-evolution' is load-bearing for the central performance claims. Trajectory success on one input sequence does not entail semantic correctness or behavioral equivalence for all inputs or future compositions; without explicit mechanisms (e.g., exhaustive testing or formal verification of pre/post-conditions across edge cases in math/code domains), invalid or inefficient composites could enter the DAG and be amplified by retrieval pruning and the expansion-size reward.
minor comments (2)
- [Abstract] The abstract refers to 'conservative updates' supporting monotone co-evolution but provides no definition or pseudocode for the update rule, which is needed to assess whether the monotonicity claim holds by construction.
- [Abstract] No specific benchmark scores, error bars, number of runs, or data-exclusion criteria are reported for the GSM8K/MATH results or the 8B-vs-32B comparison, hindering assessment of the reliability of the claimed gains over baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, particularly the careful scrutiny of the validation claims in the abstract. We address the concern point by point below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that successful trajectories are 'folded into validated composite tools' that preserve correctness while supporting 'compositional advantage under the shaped reward' and 'monotone co-evolution' is load-bearing for the central performance claims. Trajectory success on one input sequence does not entail semantic correctness or behavioral equivalence for all inputs or future compositions; without explicit mechanisms (e.g., exhaustive testing or formal verification of pre/post-conditions across edge cases in math/code domains), invalid or inefficient composites could enter the DAG and be amplified by retrieval pruning and the expansion-size reward.
Authors: We agree that success on a single trajectory provides only empirical evidence rather than a formal guarantee of semantic correctness across all inputs. In the manuscript, composite folding extracts sub-sequences from successful trajectories and validates them by (i) re-executing the composite on the original inputs to confirm output equivalence, (ii) checking that the declared pre- and post-conditions hold for those inputs, and (iii) verifying consistency with the stored examples. The shaped reward then credits the composite only by its primitive expansion size if the trajectory reward improves, and the conservative update rule (detailed in Section 4.3) ensures that only non-decreasing updates are accepted, supporting the claimed monotone co-evolution. Nevertheless, we acknowledge that this procedure does not include exhaustive testing or formal verification over edge cases, leaving open the theoretical possibility of inefficient or contextually invalid composites entering the DAG. To strengthen the presentation, we will revise the abstract to qualify the term 'validated' and add a new subsection (3.4) that explicitly describes the validation steps, their limitations, and the safeguards provided by the reward shaping and conservative updates. We will also include a brief discussion of potential amplification risks and how retrieval filtering mitigates them in practice. revision: partial
Circularity Check
No significant circularity; derivation self-contained
full rationale
The provided abstract and context describe CoCoDA's DAG structure, Typed DAG Retrieval, trajectory folding into composites, and DAG-induced reward without exhibiting any equations or self-citations that reduce claimed theoretical results (retrieval cost reduction, sublinear time, compositional advantage, monotone co-evolution, DAG well-formedness) to inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear. The framework definitions and properties are presented as distinct, with theoretical results positioned as derived consequences rather than tautological. This is the expected honest non-finding when no explicit reduction is quotable.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[2]
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =
-
[3]
Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle =
-
[4]
and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E
Patil, Shishir G. and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E. , booktitle =. Gorilla: Large Language Model Connected with Massive
-
[5]
Du, Yu and Wei, Fangyun and Zhang, Hongyang , booktitle =
-
[6]
Tang, Qiaoyu and Deng, Ziliang and Lin, Hongyu and Han, Xianpei and Liang, Qiao and Cao, Boxi and Sun, Le , journal =
-
[7]
Liu, Jiate and Zhu, Yiqin and Xiao, Kaiwen and Fu, Qiang and Han, Xiao and Yang, Wei and Ye, Deheng , journal =
-
[8]
Li, Xuefeng and Zou, Haoyang and Liu, Pengfei , journal =
-
[9]
Transactions on Machine Learning Research (TMLR) , year =
Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. Transactions on Machine Learning Research (TMLR) , year =
-
[10]
International Conference on Learning Representations (ICLR) , year =
Large Language Models as Tool Makers , author =. International Conference on Learning Representations (ICLR) , year =
-
[11]
and Qin, Yujia and Liu, Zhiyuan and Ji, Heng , booktitle =
Qian, Cheng and Han, Chi and Fung, Yi R. and Qin, Yujia and Liu, Zhiyuan and Ji, Heng , booktitle =
-
[12]
Wang, Zhiruo and Fried, Daniel and Neubig, Graham , booktitle =
-
[13]
Stengel-Eskin, Elias and Prasad, Archiki and Bansal, Mohit , booktitle =
-
[14]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Qian, Cheng and Acikgoz, Emre Can and He, Qi and Wang, Hongru and Chen, Xiusi and Hakkani-T. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[15]
Feng, Jiazhan and Huang, Shijue and Qu, Xingwei and Zhang, Ge and Qin, Yujia and Zhong, Baoquan and Jiang, Chengquan and Chi, Jinxin and Zhong, Wanjun , journal =
-
[16]
Agentic Reasoning and Tool Integration for
Singh, Joykirat and Magazine, Raghav and Pandya, Yash and Nambi, Akshay , journal =. Agentic Reasoning and Tool Integration for
-
[17]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Yang and Guo, Daya , journal =
-
[18]
arXiv preprint arXiv:2501.12948 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets , author=. 2024 , eprint=
work page 2024
-
[20]
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Measuring Mathematical Problem Solving With the
Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the
-
[22]
Compositional Semantic Parsing on Semi-Structured Tables , author =. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =. 2015 , address =
work page 2015
-
[23]
Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang , booktitle =. 2021 , publisher =
work page 2021
-
[24]
Yu, Zhaojian and Zhao, Yilun and Cohan, Arman and Zhang, Xiao-Ping , booktitle =. 2025 , address =
work page 2025
-
[25]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[26]
Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by
-
[27]
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Program Synthesis with Large Language Models
Program Synthesis with Large Language Models , author=. arXiv preprint arXiv:2108.07732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =
-
[30]
and Stoica, Ion and Gonzalez, Joseph E
Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E. , booktitle =
-
[31]
Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , booktitle =
-
[32]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[33]
Zhu, Xizhou and Chen, Yuntao and Tian, Hao and Tao, Chenxin and Su, Weijie and Yang, Chenyu and Huang, Gao and Li, Bin and Lu, Lewei and Wang, Xiaogang and Qiao, Yu and Zhang, Zhaoxiang and Dai, Jifeng , journal =. Ghost in the
-
[34]
Experiential Co-Learning of Software-Developing Agents , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[35]
Ellis, Kevin and Wong, Catherine and Nye, Maxwell and Sabl. Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI) , year =
-
[36]
Proceedings of the ACM on Programming Languages (POPL) , year =
Top-Down Synthesis for Library Learning , author =. Proceedings of the ACM on Programming Languages (POPL) , year =
-
[37]
and Xie, Yiqing and Neubig, Graham and Fried, Daniel , booktitle =
Wang, Zora Zhiruo and Asai, Akari and Yu, Xinyan Velocity and Xu, Frank F. and Xie, Yiqing and Neubig, Graham and Fried, Daniel , booktitle =
-
[38]
Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D. Q. and Li, Junnan and Hoi, Steven C. H. , booktitle =
-
[39]
Sarthi, Parth and Abdullah, Salman and Tuli, Aditi and Khanna, Shubh and Goldie, Anna and Manning, Christopher D. , journal =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.