pith. sign in

arxiv: 2606.24446 · v1 · pith:LN4OOVWVnew · submitted 2026-06-23 · 💻 cs.SE

Agentic Generation of AST Transformation Rules for Fixing Breaking Updates

Pith reviewed 2026-06-25 22:53 UTC · model grok-4.3

classification 💻 cs.SE
keywords breaking changesAPI evolutionprogram repairAST transformationslarge language modelsdependency updatessoftware maintenance
0
0 comments X

The pith

An agentic framework generates reusable AST transformations that fix breaking library updates across projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern software relies on evolving third-party libraries that introduce breaking API changes, forcing repeated repairs for each affected project. BigBag uses large language models in an agentic process to produce structured, executable AST transformation programs that encode repair logic at the API level. These programs are evaluated on 157 breaking dependency updates from the BUMP benchmark using four LLMs and two AST engines. The strongest setup yields 94.3 percent compilable transformations and 78.6 percent fixes, with generated rules transferring to other projects at 33.3 percent overall and 80 percent or higher when API usage is uniform across clients.

Core claim

BigBag generates fixing transformations as executable programs that encode repair logic at the API level and transfer to any client broken by the same update, achieving 94.3 percent compilable rate and 78.6 percent fix rate on 157 BUMP cases with cross-project transfer of 33.3 percent overall.

What carries the argument

BigBag, an agentic framework that combines large language models with AST transformation engines (Spoon and JavaParser) to produce reusable fixing programs for breaking dependency updates.

If this is right

  • Repair logic can be written once per breaking update and applied to every client project affected by that update.
  • Projects can adopt library updates without each one requiring separate patch generation.
  • Fixes become more reliable when all clients invoke the changed API element in the same way.
  • The method applies to Java projects and can use either Spoon or JavaParser as the underlying AST engine.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A shared repository of generated transformations could reduce duplicated repair work across the open-source ecosystem.
  • The approach could be tested on breaking changes that affect only a subset of clients to measure how performance drops with non-uniform usage.
  • Integration with continuous integration pipelines might allow automatic detection of breaking updates followed by application of stored transformations.

Load-bearing premise

The agentic LLM process produces correct, executable AST transformations that generalize to new cases without introducing fresh compilation or runtime errors.

What would settle it

Running the same eight configurations on a fresh collection of breaking updates outside the 157 BUMP cases and obtaining fix rates well below 78.6 percent would show the approach does not scale as claimed.

Figures

Figures reproduced from arXiv: 2606.24446 by Benoit Baudry, Frank Reyes, Martin Monperrus.

Figure 1
Figure 1. Figure 1: Overview of the BIGBAG workflow. Step 1: BIGBAG passes three inputs to the agent: the client project with a broken build, the AST engine API documentation, and a transformation template. Step 2: the agent executes a generate-apply-verify loop, generating a fixing transformation, applying it to the client sources, and triggering a Maven build; on failure, the agent revises and retries until the build succee… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of compilation error counts per breaking [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Methodology for RQ3: each effective transformation [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RQ1: non-compilable fixing transformations per model [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Modern software projects depend on third-party libraries that evolve continuously, introducing breaking API changes that prevent client code from compiling after a dependency update. When the same library update breaks multiple projects, existing repair approaches generate project-specific patches that cannot be reused, requiring each affected project to be repaired independently. We present BigBag, an agentic framework that generates fixing transformations: structured, executable programs that encode the repair logic at the API level and transfer to any client broken by the same update. We evaluate BigBag on 157 compilation failure breaking dependency updates from the BUMP benchmark, across eight configurations combining four large language models and two AST transformation engines (Spoon and JavaParser). The best configuration achieves a compilable transformation rate of 94.3% and a fix rate of 78.6%. Generated transformations transfer across projects, achieving a cross-project fix rate of 33.3% overall and 80% or above for breaking updates where all clients invoke the affected API element uniformly. These results show that agentic generation of reusable fixing transformations is a viable approach to scalable repair of breaking updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents BigBag, an agentic framework that uses LLMs to generate structured, executable AST transformation rules (via Spoon or JavaParser) for repairing breaking dependency updates in Java client code. Evaluated on 157 compilation-failure cases from the BUMP benchmark across eight LLM+engine configurations, the best setup reports a 94.3% rate of compilable transformations and a 78.6% fix rate; generated rules also transfer across projects at 33.3% overall (and ≥80% when all clients invoke the affected API uniformly). The central claim is that this yields a viable, scalable approach to reusable repair of breaking updates.

Significance. If the generated transformations are shown to be semantically correct, the work would demonstrate a practical route to reusable, API-level fixes that avoid per-project patching. The empirical scale (157 cases, multiple models/engines, concrete rates, and cross-project transfer results) and use of an external benchmark are strengths that would support the viability argument.

major comments (2)
  1. [Evaluation] Evaluation section (results on 157 BUMP cases): success is defined exclusively via post-application compilation (94.3% compilable transformations, 78.6% fix rate). No test-suite execution, differential testing, or semantic-equivalence oracle is reported, so the rates do not establish that the transformations preserve original behavior or avoid introducing new runtime faults. This directly affects the interpretation of both the headline fix rate and the cross-project transfer claim.
  2. [Evaluation] Cross-project transfer results (33.3% overall, ≥80% on uniform-invocation subset): the paper does not detail how the subset of cases with uniform API invocation was identified or whether the same transformation was applied without project-specific adaptation. Without this, it is unclear whether the reported transfer rates reflect genuine reusability or selection effects in the benchmark.
minor comments (1)
  1. [Abstract] The abstract and evaluation should explicitly state the exact definition of 'fix rate' (e.g., does it require the original failing compilation to succeed, or merely any compilation success?).

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to improve clarity on the evaluation methodology and limitations.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (results on 157 BUMP cases): success is defined exclusively via post-application compilation (94.3% compilable transformations, 78.6% fix rate). No test-suite execution, differential testing, or semantic-equivalence oracle is reported, so the rates do not establish that the transformations preserve original behavior or avoid introducing new runtime faults. This directly affects the interpretation of both the headline fix rate and the cross-project transfer claim.

    Authors: We agree that compilation success serves as a proxy metric and does not by itself confirm semantic equivalence or the absence of new runtime issues. The BUMP benchmark centers on compilation failures induced by breaking updates, so our fix rate specifically quantifies restoration of compilability. We have revised the Evaluation section to explicitly define the scope of this metric and added a dedicated Limitations subsection that acknowledges the lack of test-suite or oracle-based validation while outlining plans for future semantic verification work. revision: yes

  2. Referee: [Evaluation] Cross-project transfer results (33.3% overall, ≥80% on uniform-invocation subset): the paper does not detail how the subset of cases with uniform API invocation was identified or whether the same transformation was applied without project-specific adaptation. Without this, it is unclear whether the reported transfer rates reflect genuine reusability or selection effects in the benchmark.

    Authors: The uniform-invocation subset was identified via manual review of client code in the BUMP cases, classifying an update as uniform when every affected client invoked the changed API element identically (same signature and usage pattern). The identical generated transformation rule was then applied to all projects in the subset with no per-project adaptation. We have added a new explanatory paragraph in the Cross-Project Transfer subsection that describes the identification criteria and confirms verbatim rule application. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external benchmark

full rationale

The paper reports experimental results from applying an agentic LLM framework to the independent BUMP benchmark of 157 breaking updates, measuring compilable transformation rate (94.3%) and fix rate (78.6%) plus cross-project transfer. No equations, parameters, derivations, or self-citations are used to derive claims; success metrics are defined directly from observable outcomes on the external test set. The work is self-contained against external benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can produce correct reusable AST transformations and that the BUMP benchmark is representative of real breaking updates.

axioms (1)
  • domain assumption Large language models can generate correct and executable AST transformation rules for API-level breaking changes
    The framework depends on LLM capability to output valid, reusable transformations without extensive post-processing.

pith-pipeline@v0.9.1-grok · 5719 in / 1323 out tokens · 31195 ms · 2026-06-25T22:53:42.511335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 2 canonical work pages

  1. [1]

    An empirical comparison of dependency network evolution in seven software packaging ecosystems,

    A. Decan, T. Mens, and P. Grosjean, “An empirical comparison of dependency network evolution in seven software packaging ecosystems,” Empirical Software Engineering, vol. 24, no. 1, pp. 381–416, Feb

  2. [2]

    Available: https://doi.org/10.1007/s10664-017-9589-y

    [Online]. Available: https://doi.org/10.1007/s10664-017-9589-y

  3. [3]

    Why and how Java developers break APIs,

    A. Brito, L. Xavier, A. Hora, and M. T. Valente, “Why and how Java developers break APIs,” in2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2018, pp. 255–265

  4. [4]

    Can automated pull requests encourage software developers to upgrade out-of-date dependencies?

    S. Mirhosseini and C. Parnin, “Can automated pull requests encourage software developers to upgrade out-of-date dependencies?” in2017 32nd IEEE/ACM international conference on automated software engineering (ASE). IEEE, 2017, pp. 84–94

  5. [5]

    How do APIs evolve? A story of refactoring: Research Articles,

    D. Dig and R. Johnson, “How do APIs evolve? A story of refactoring: Research Articles,”J. Softw. Maint. Evol., vol. 18, no. 2, p. 83–107, Mar. 2006

  6. [6]

    Breaking bad? semantic versioning and impact of breaking changes in maven central: An external and differentiated replication study,

    L. Ochoa, T. Degueule, J.-R. Falleri, and J. Vinju, “Breaking bad? semantic versioning and impact of breaking changes in maven central: An external and differentiated replication study,”Empirical Software Engineering, vol. 27, no. 3, p. 61, 2022

  7. [7]

    Byam: Fixing breaking dependency updates with large language models,

    F. Reyes, M. Mahmoud, F. Bono, S. Nadi, B. Baudry, and M. Monper- rus, “Byam: Fixing breaking dependency updates with large language models,”Empirical Software Engineering, vol. 31, no. 4, p. 113, 2026

  8. [8]

    Automatically Fixing Dependency Breaking Changes,

    F. Lukas and K. Jens, “Automatically Fixing Dependency Breaking Changes,”Proceedings of the Conference on the Foundations of Software Engineering, pp. 2146–2168, 2025

  9. [9]

    ReAct: Synergizing Reasoning and Acting in Language Models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing Reasoning and Acting in Language Models,” in Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023). Kigali, Rwanda: OpenReview.net, 2023. [Online]. Available: https://openreview.net/forum?id=WE vluYUL-X

  10. [10]

    BUMP: A Benchmark of Reproducible Breaking Dependency Up- dates,

    F. Reyes, Y . Gamage, G. Skoglund, B. Baudry, and M. Monperrus, “BUMP: A Benchmark of Reproducible Breaking Dependency Up- dates,” in2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2024, pp. 159–170

  11. [11]

    SPELL: Synthesis of Programmatic Edits using LLMs,

    D. Ramos, C. Gamboa, I. Lynce, V . Manquinho, R. Martins, and C. Le Goues, “SPELL: Synthesis of Programmatic Edits using LLMs,”arXiv preprint arXiv:2602.01107, 2026

  12. [12]

    Do developers update their library dependencies? An empirical study on the impact of security advisories on library migration,

    R. G. Kula, D. M. German, A. Ouni, T. Ishio, and K. Inoue, “Do developers update their library dependencies? An empirical study on the impact of security advisories on library migration,”Empirical Software Engineering, vol. 23, no. 1, pp. 384–417, 2018

  13. [13]

    A survey of software refactoring,

    T. Mens and T. Tourwe, “A survey of software refactoring,”IEEE Transactions on Software Engineering, vol. 30, no. 2, pp. 126–139, 2004

  14. [14]

    Source transformation, analysis and generation in TXL,

    J. R. Cordy, “Source transformation, analysis and generation in TXL,” inProceedings of the 2006 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-Based Program Manipulation, ser. PEPM ’06. New York, NY , USA: Association for Computing Machinery, 2006, p. 1–11. [Online]. Available: https://doi.org/10.1145/1111542.1111544

  15. [15]

    Design patterns: Abstraction and reuse of object-oriented design,

    E. Gamma, R. Helm, R. Johnson, and J. Vlissides, “Design patterns: Abstraction and reuse of object-oriented design,” inEuropean conference on object-oriented programming. Springer, 1993, pp. 406–431

  16. [16]

    Spoon: A Library for Implementing Analyses and Transformations of Java Source Code,

    R. Pawlak, M. Monperrus, N. Petitprez, C. Noguera, and L. Seinturier, “Spoon: A Library for Implementing Analyses and Transformations of Java Source Code,”Software: Practice and Experience, vol. 46, pp. 1155–1179, 2015. [Online]. Available: https://hal.archives-ouvertes.fr/ hal-01078532/document

  17. [17]

    Javaparser: visited,

    N. Smith, D. Van Bruggen, and F. Tomassetti, “Javaparser: visited,” Leanpub, oct. de, vol. 10, pp. 29–40, 2017

  18. [18]

    Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,”Advances in Neural Information Processing Systems, vol. 37, pp. 50 528–50 652, 2024

  19. [19]

    Introducing GPT-5.4 mini and nano — OpenAI

    “Introducing GPT-5.4 mini and nano — OpenAI.” [Online]. Available: https://openai.com/index/introducing-gpt-5-4-mini-and-nano/

  20. [20]

    Qwen3 Coder 30B A3B Instruct - API Pricing & Benchmarks — OpenRouter

    “Qwen3 Coder 30B A3B Instruct - API Pricing & Benchmarks — OpenRouter.” [Online]. Available: https://openrouter.ai/qwen/ qwen3-coder-30b-a3b-instruct

  21. [21]

    Deepseek-v3. 2: Pushing the frontier of open large language models,

    A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Donget al., “Deepseek-v3. 2: Pushing the frontier of open large language models,”arXiv preprint arXiv:2512.02556, 2025

  22. [22]

    Gemini 3.1 Pro - Model Card — Google DeepMind

    “Gemini 3.1 Pro - Model Card — Google DeepMind.” [Online]. Avail- able: https://deepmind.google/models/model-cards/gemini-3-1-pro/

  23. [23]

    On randomness in agentic evals,

    B. H. Bjarnason, A. Silva, and M. Monperrus, “On randomness in agentic evals,”arXiv preprint arXiv:2602.07150, 2026

  24. [24]

    Recommending adaptive changes for framework evolution,

    B. Dagenais and M. P. Robillard, “Recommending adaptive changes for framework evolution,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 20, no. 4, pp. 1–35, 2011. 14

  25. [25]

    Breaking-Good: Explaining Breaking Dependency Updates with Build Analysis,

    F. Reyes, B. Baudry, and M. Monperrus, “Breaking-Good: Explaining Breaking Dependency Updates with Build Analysis,” in2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM), 2024, pp. 36–46

  26. [26]

    MELT: Mining Effective Lightweight Transformations from Pull Requests,

    D. Ramos, H. Mitchell, I. Lynce, V . M. Manquinho, R. Martins, and C. Le Goues, “MELT: Mining Effective Lightweight Transformations from Pull Requests,” in2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2023, pp. 1516–1528

  27. [27]

    FOSSA: Automated dependency review and breaking change analysis,

    “FOSSA: Automated dependency review and breaking change analysis,” https://fossa.com/products/fossabot/, 2025, accessed 2025

  28. [28]

    Patchwork: Automated patch generation and dependency repair,

    “Patchwork: Automated patch generation and dependency repair,” https: //docs.patched.codes/patchwork/overview, 2025, accessed 2025

  29. [29]

    Genprog: A generic method for automatic software repair,

    C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A generic method for automatic software repair,”Ieee transactions on software engineering, vol. 38, no. 1, pp. 54–72, 2011

  30. [30]

    Automated program repair in the era of large pre-trained language models,

    C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1482–1494

  31. [31]

    An analysis of the automatic bug fixing performance of chatgpt,

    D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of the automatic bug fixing performance of chatgpt,” in2023 IEEE/ACM International Workshop on Automated Program Repair (APR). IEEE, 2023, pp. 23–30

  32. [32]

    Learning Syntactic Program Transformations from Examples,

    R. R. de Sousa, G. Soares, L. D’Antoni, O. Polozov, S. Gulwani, R. Gheyi, R. Suzuki, and B. Hartmann, “Learning Syntactic Program Transformations from Examples,”2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp. 404–415, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:11216724

  33. [33]

    Learning to fix build errors with graph2diff neural net- works,

    D. Tarlow, S. Moitra, A. Rice, Z. Chen, P.-A. Manzagol, C. Sutton, and E. Aftandilian, “Learning to fix build errors with graph2diff neural net- works,” inProceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, 2020, pp. 19–20

  34. [34]

    Auto-repair without test cases: How LLMs fix compilation errors in large industrial embedded code,

    H. Fu, S. Eldh, K. Wiklund, A. Ermedahl, P. Haller, and C. Artho, “Auto-repair without test cases: How LLMs fix compilation errors in large industrial embedded code,” in2025 28th Euromicro Conference on Digital System Design (DSD), 2025, pp. 97–105

  35. [35]

    Evaluating large language models trained on code,

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  36. [36]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  37. [37]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” 2024. [Online]. Available: https://arxiv. org/abs/2310.06770

  38. [38]

    GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities,

    D. Misra, N. Islah, V . May, B. Rauby, Z. Wang, J. Gehring, A. Orvieto, M. Chaudhary, E. B. Muller, I. Rish, S. Ebrahimi Kahou, and M. Caccia, “GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities,” inProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

  39. [39]

    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration,

    V . May, D. Misra, Y . Luo, A. Sridhar, J. Gehring, and S. Soares Ribeiro Junior, “FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration,”arXiv preprint arXiv:2510.04852, 2025

  40. [40]

    Code Transformation Rule Synthesis Using LLMs: Potential and Limits,

    A. Allain, A. Blot, D. E. Khelladi, and M. Acher, “Code Transformation Rule Synthesis Using LLMs: Potential and Limits,” HAL preprint hal-05607247, 2026. [Online]. Available: https://hal. science/hal-05607247v1

  41. [41]

    Unprecedented Code Change Automation: The Fusion of LLMs and Transformation by Example,

    M. Dilhara, A. Bellur, T. Bryksin, and D. Dig, “Unprecedented Code Change Automation: The Fusion of LLMs and Transformation by Example,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, 2024

  42. [42]

    Don’t Transform the Code, Code the Trans- forms: Towards Precise Code Rewriting using LLMs,

    C. Cummins, V . Seeker, J. Armengol-Estap ´e, A. H. Markosyan, G. Syn- naeve, and H. Leather, “Don’t Transform the Code, Code the Trans- forms: Towards Precise Code Rewriting using LLMs,” arXiv preprint arXiv:2410.08806, 2024