arxiv: 2605.08399 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents

Ziyang Yu , Qiyue Li , Liang Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:16 UTC · model grok-4.3

classification 💻 cs.AI

keywords tool-augmented agentscompositional DAGco-evolutionmathematical reasoninglibrary learningcode generation

0 comments

The pith

CoCoDA co-evolves a planner with a compositional code DAG so an 8B model matches or exceeds a 32B model on GSM8K and MATH.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that tool-augmented agents can scale without exploding context costs or flattening tool libraries into text memories. It does this by storing tools in a directed acyclic graph whose nodes carry typed signatures, descriptions, pre- and post-conditions, and examples, while edges record invocation dependencies. Successful task trajectories are turned into validated composite nodes, and the planner receives a reward shaped by the number of primitive steps a composite replaces. Retrieval prunes candidates first by symbolic signature unification, then ranks by description, filters by behavioral specs, and finally disambiguates with examples. The result is sublinear retrieval cost and a measurable performance lift for small models on mathematical, tabular, and code tasks.

Core claim

CoCoDA maintains a single code-native DAG in which primitive tools and composite tools are nodes, invocation dependencies are edges, and each node records a typed signature, description, pre/post-condition specification, and worked examples. At inference, Typed DAG Retrieval uses symbolic unification to discard mismatched candidates early, then progressively applies cheaper filters before materializing full context. At training, each successful trajectory is folded into a new validated composite node, and the planner is updated with a DAG-induced reward that credits a composite in proportion to its primitive expansion size. Theoretical arguments establish retrieval-cost reduction, sublinear

What carries the argument

The compositional code DAG whose nodes are typed executable tools and whose reward is proportional to primitive expansion size.

Load-bearing premise

Successful trajectories can be folded into composite tools that preserve correctness and yield net compositional advantage under conservative updates.

What would settle it

An experiment in which the 8B student under CoCoDA scores lower than the 32B teacher on GSM8K or MATH, or in which measured retrieval cost grows linearly with library size.

Figures

Figures reproduced from arXiv: 2605.08399 by Liang Zhao, Qiyue Li, Ziyang Yu.

**Figure 1.** Figure 1: Overview of the CoCoDA framework. Left (Tool Structure): each tool stores a 4-layer record (L1 signature, L2 description, L3 specifications, L4 examples), used by Typed DAG Retrieval as a cascade filter from cheap to expensive layers. Middle (Compositional Tool Library): a code DAG of primitive and composite tools (e.g., compute_ratio, solve_equation) whose edges encode invocation dependencies, letting mac… view at source ↗

**Figure 2.** Figure 2: Left. Co-evolution dynamics of the tool library under GRPO training. Right. Mean prompt-token cost to surface a top-k tool set as library size grows. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy and per-query latency as a function of library size at a fixed [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: A subgraph of the Compositional Tool DAG on the math domain after training. Edges point [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

read the original abstract

Tool-augmented language models can extend small language models with external executable skills, but scaling the tool library creates a coupled challenge: the library must evolve with the planner as new reusable subroutines emerge, while retrieval from the growing library must remain within a fixed context budget. Existing tool-use and skill-library methods typically treat tools as flat or text-indexed memories, causing prompt cost to grow with library size and obscuring the typed, compositional structure of executable code. We propose CoCoDA, a framework that co-evolves the planner and tool library through a single code-native structure: a compositional code DAG. Nodes are primitive or composite tools, edges encode invocation dependencies, and each node stores a typed signature, description, pre/post-condition specification, and worked examples. At inference time, Typed DAG Retrieval prunes candidates by symbolic signature unification, ranks survivors by descriptions, filters them by behavioral specifications, and disambiguates with examples, keeping expensive context materialization on progressively smaller candidate sets. At training time, successful trajectories are folded into validated composite tools, while the planner is updated with a DAG-induced reward that credits composites by their primitive expansion size. We provide theoretical results showing retrieval cost reduction, sublinear retrieval time, compositional advantage under the shaped reward, monotone co-evolution under conservative updates, and DAG well-formedness. Across mathematical reasoning, tabular analysis, and code task benchmarks, CoCoDA enables an 8B student to match or exceed a 32B teacher on GSM8K and MATH and consistently improves over strong tool-use and library-learning baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoCoDA introduces a code-native compositional DAG with multi-stage typed retrieval and trajectory folding to co-evolve agents and tool libraries, claiming strong gains for small models on math and code tasks.

read the letter

The core idea is a DAG where nodes hold primitive or composite tools with typed signatures, descriptions, pre/post specs, and examples. Successful trajectories get folded into new composite nodes, and the planner gets a reward shaped by how much a composite expands into primitives. Retrieval at inference prunes first by symbolic signature unification, then ranks by description, filters by specs, and disambiguates with examples. This keeps context use from scaling linearly with library size. The abstract reports theoretical results on sublinear retrieval, compositional advantage under the shaped reward, monotone co-evolution, and DAG well-formedness, plus empirical wins where an 8B model matches or beats a 32B teacher on GSM8K and MATH while beating tool-use and library-learning baselines on reasoning, tabular, and code tasks. That combination of structure and co-evolution is the actual novelty; prior work mostly used flat lists or text search. The paper does a clean job laying out the retrieval pipeline and the reward design in the abstract. The main soft spot is that we only have the abstract, so the folding step's claim that composites preserve correctness beyond the original trajectory is unverified here, and the stress-test concern about edge cases in math or code domains lands as a real open question. Experimental details, error bars, and exact protocols are missing, which makes it hard to judge how robust the reported gains are. This is aimed at people building tool-augmented agents who care about library scaling and structured retrieval. A reader working on practical multi-step reasoning or code agents would find the retrieval stages and the DAG reward worth thinking about. It deserves peer review because the problem is real, the proposed structure is distinct, and the benchmark claims are specific enough to be checked.

Referee Report

1 major / 2 minor

Summary. The paper proposes CoCoDA, a framework that co-evolves a planner and tool library for tool-augmented agents via a single compositional code DAG. Nodes represent primitive or composite tools with typed signatures, descriptions, pre/post-conditions, and examples; edges encode invocation dependencies. Typed DAG Retrieval prunes by symbolic signature unification, ranks by descriptions, filters by behavioral specs, and disambiguates via examples to keep retrieval efficient within fixed context. Training folds successful trajectories into validated composite tools and updates the planner with a DAG-induced reward that credits composites by primitive expansion size. Theoretical results are claimed on retrieval cost reduction, sublinear time, compositional advantage under the shaped reward, monotone co-evolution under conservative updates, and DAG well-formedness. Experiments report that an 8B student model matches or exceeds a 32B teacher on GSM8K and MATH while improving over tool-use and library-learning baselines across math, tabular, and code tasks.

Significance. If the theoretical results and experimental claims hold, this would represent a meaningful advance in scalable tool-augmented agents by addressing library growth through explicit compositional structure and co-evolution rather than flat indexing. The combination of symbolic pruning with behavioral filtering and the shaped reward for composition size could enable smaller models to achieve strong performance via reusable subroutines, with the claimed monotone co-evolution and sublinear retrieval offering practical benefits for long-term library maintenance. The provision of multiple theoretical results is a positive feature, though their impact depends on the rigor of the derivations.

major comments (1)

[Abstract] Abstract: The claim that successful trajectories are 'folded into validated composite tools' that preserve correctness while supporting 'compositional advantage under the shaped reward' and 'monotone co-evolution' is load-bearing for the central performance claims. Trajectory success on one input sequence does not entail semantic correctness or behavioral equivalence for all inputs or future compositions; without explicit mechanisms (e.g., exhaustive testing or formal verification of pre/post-conditions across edge cases in math/code domains), invalid or inefficient composites could enter the DAG and be amplified by retrieval pruning and the expansion-size reward.

minor comments (2)

[Abstract] The abstract refers to 'conservative updates' supporting monotone co-evolution but provides no definition or pseudocode for the update rule, which is needed to assess whether the monotonicity claim holds by construction.
[Abstract] No specific benchmark scores, error bars, number of runs, or data-exclusion criteria are reported for the GSM8K/MATH results or the 8B-vs-32B comparison, hindering assessment of the reliability of the claimed gains over baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, particularly the careful scrutiny of the validation claims in the abstract. We address the concern point by point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that successful trajectories are 'folded into validated composite tools' that preserve correctness while supporting 'compositional advantage under the shaped reward' and 'monotone co-evolution' is load-bearing for the central performance claims. Trajectory success on one input sequence does not entail semantic correctness or behavioral equivalence for all inputs or future compositions; without explicit mechanisms (e.g., exhaustive testing or formal verification of pre/post-conditions across edge cases in math/code domains), invalid or inefficient composites could enter the DAG and be amplified by retrieval pruning and the expansion-size reward.

Authors: We agree that success on a single trajectory provides only empirical evidence rather than a formal guarantee of semantic correctness across all inputs. In the manuscript, composite folding extracts sub-sequences from successful trajectories and validates them by (i) re-executing the composite on the original inputs to confirm output equivalence, (ii) checking that the declared pre- and post-conditions hold for those inputs, and (iii) verifying consistency with the stored examples. The shaped reward then credits the composite only by its primitive expansion size if the trajectory reward improves, and the conservative update rule (detailed in Section 4.3) ensures that only non-decreasing updates are accepted, supporting the claimed monotone co-evolution. Nevertheless, we acknowledge that this procedure does not include exhaustive testing or formal verification over edge cases, leaving open the theoretical possibility of inefficient or contextually invalid composites entering the DAG. To strengthen the presentation, we will revise the abstract to qualify the term 'validated' and add a new subsection (3.4) that explicitly describes the validation steps, their limitations, and the safeguards provided by the reward shaping and conservative updates. We will also include a brief discussion of potential amplification risks and how retrieval filtering mitigates them in practice. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The provided abstract and context describe CoCoDA's DAG structure, Typed DAG Retrieval, trajectory folding into composites, and DAG-induced reward without exhibiting any equations or self-citations that reduce claimed theoretical results (retrieval cost reduction, sublinear time, compositional advantage, monotone co-evolution, DAG well-formedness) to inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear. The framework definitions and properties are presented as distinct, with theoretical results positioned as derived consequences rather than tautological. This is the expected honest non-finding when no explicit reduction is quotable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, preventing extraction of concrete free parameters, axioms, or invented entities. The DAG itself functions as a modeling structure rather than a new postulated physical entity; no independent evidence for any new entity is supplied.

pith-pipeline@v0.9.0 · 5586 in / 1418 out tokens · 75333 ms · 2026-05-12T01:16:24.997355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

[1]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[2]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =

work page
[3]

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle =

work page
[4]

and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E

Patil, Shishir G. and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E. , booktitle =. Gorilla: Large Language Model Connected with Massive

work page
[5]

Du, Yu and Wei, Fangyun and Zhang, Hongyang , booktitle =

work page
[6]

Tang, Qiaoyu and Deng, Ziliang and Lin, Hongyu and Han, Xianpei and Liang, Qiao and Cao, Boxi and Sun, Le , journal =

work page
[7]

Liu, Jiate and Zhu, Yiqin and Xiao, Kaiwen and Fu, Qiang and Han, Xiao and Yang, Wei and Ye, Deheng , journal =

work page
[8]

Li, Xuefeng and Zou, Haoyang and Liu, Pengfei , journal =

work page
[9]

Transactions on Machine Learning Research (TMLR) , year =

Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. Transactions on Machine Learning Research (TMLR) , year =

work page
[10]

International Conference on Learning Representations (ICLR) , year =

Large Language Models as Tool Makers , author =. International Conference on Learning Representations (ICLR) , year =

work page
[11]

and Qin, Yujia and Liu, Zhiyuan and Ji, Heng , booktitle =

Qian, Cheng and Han, Chi and Fung, Yi R. and Qin, Yujia and Liu, Zhiyuan and Ji, Heng , booktitle =

work page
[12]

Wang, Zhiruo and Fried, Daniel and Neubig, Graham , booktitle =

work page
[13]

Stengel-Eskin, Elias and Prasad, Archiki and Bansal, Mohit , booktitle =

work page
[14]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Qian, Cheng and Acikgoz, Emre Can and He, Qi and Wang, Hongru and Chen, Xiusi and Hakkani-T. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[15]

Feng, Jiazhan and Huang, Shijue and Qu, Xingwei and Zhang, Ge and Qin, Yujia and Zhong, Baoquan and Jiang, Chengquan and Chi, Jinxin and Zhong, Wanjun , journal =

work page
[16]

Agentic Reasoning and Tool Integration for

Singh, Joykirat and Magazine, Raghav and Pandya, Yash and Nambi, Akshay , journal =. Agentic Reasoning and Tool Integration for

work page
[17]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Yang and Guo, Daya , journal =

work page
[18]

arXiv preprint arXiv:2501.12948 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[19]

2024 , eprint=

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets , author=. 2024 , eprint=

work page 2024
[20]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Measuring Mathematical Problem Solving With the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the

work page
[22]

Compositional Semantic Parsing on Semi-Structured Tables , author =. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =. 2015 , address =

work page 2015
[23]

2021 , publisher =

Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang , booktitle =. 2021 , publisher =

work page 2021
[24]

2025 , address =

Yu, Zhaojian and Zhao, Yilun and Cohan, Arman and Zhang, Xiao-Ping , booktitle =. 2025 , address =

work page 2025
[25]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[26]

Is Your Code Generated by

Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by

work page
[27]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Program Synthesis with Large Language Models

Program Synthesis with Large Language Models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

work page
[30]

and Stoica, Ion and Gonzalez, Joseph E

Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E. , booktitle =

work page
[31]

Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , booktitle =

work page
[32]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[33]

Ghost in the

Zhu, Xizhou and Chen, Yuntao and Tian, Hao and Tao, Chenxin and Su, Weijie and Yang, Chenyu and Huang, Gao and Li, Bin and Lu, Lewei and Wang, Xiaogang and Qiao, Yu and Zhang, Zhaoxiang and Dai, Jifeng , journal =. Ghost in the

work page
[34]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Experiential Co-Learning of Software-Developing Agents , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[35]

Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI) , year =

Ellis, Kevin and Wong, Catherine and Nye, Maxwell and Sabl. Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI) , year =

work page
[36]

Proceedings of the ACM on Programming Languages (POPL) , year =

Top-Down Synthesis for Library Learning , author =. Proceedings of the ACM on Programming Languages (POPL) , year =

work page
[37]

and Xie, Yiqing and Neubig, Graham and Fried, Daniel , booktitle =

Wang, Zora Zhiruo and Asai, Akari and Yu, Xinyan Velocity and Xu, Frank F. and Xie, Yiqing and Neubig, Graham and Fried, Daniel , booktitle =

work page
[38]

Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D. Q. and Li, Junnan and Hoi, Steven C. H. , booktitle =

work page
[39]

, journal =

Sarthi, Parth and Abdullah, Salman and Tuli, Aditi and Khanna, Shubh and Goldie, Anna and Manning, Christopher D. , journal =

work page