TreeAgent: A Generalizable Multi-Agent Framework for Automated Bias Labeling in Forestry via Compiled Expert Rules and Vision-Language Models

Collin Hargreaves; Huiqi Wang; Nicholas Saban; Shiyi Chen

arxiv: 2606.31976 · v1 · pith:7UNURDMDnew · submitted 2026-06-30 · 💻 cs.AI · cs.MA

TreeAgent: A Generalizable Multi-Agent Framework for Automated Bias Labeling in Forestry via Compiled Expert Rules and Vision-Language Models

Shiyi Chen , Nicholas Saban , Collin Hargreaves , Huiqi Wang This is my paper

Pith reviewed 2026-07-01 05:08 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent systemsvision-language modelsdecision treesforestrybias labelingremote sensingautomated annotation

0 comments

The pith

A multi-agent framework uses expert decision trees and vision-language models to automate bias labeling in forestry remote sensing at lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a multi-agent system that combines expert decision trees with vision-language models to perform automated bias labeling for trees in remote sensing data. It treats the decision trees as fixed structural priors and assigns VLMs to evaluate conditions at each node, using voting among agents to reduce variability. This setup is claimed to reproduce expert procedures without needing to retrain models for new decision structures, cutting down on the time and inconsistency of human annotations. The approach is tested on tree height bias classification and shown to beat standard machine learning methods while keeping the process interpretable through the expert rules.

Core claim

The central discovery is a Decoupled Declarative Decision Framework that orchestrates multiple agents, each guided by parts of an expert decision tree, to label data using VLMs for perception tasks. This allows the system to generalize to any expert-defined tree structure without changes to the code, outperforming supervised baselines and reducing expert effort needed for labeling in forestry applications.

What carries the argument

The Decoupled Declarative Decision (D3) Framework, which decouples the expert decision structure from the VLM perception and uses multi-agent voting for robustness.

If this is right

Reproduces expert-defined labeling procedures at substantially lower annotation cost.
Maintains interpretability by following the expert decision tree structure.
Outperforms supervised ML baselines on the tree bias classification testbed.
Supports zero-modification generalization across diverse expert-defined decision structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might apply to other fields that rely on decision trees for expert decisions, such as medical imaging or quality control.
Reducing annotation costs could enable larger scale datasets for training remote sensing models in forestry.
Multi-agent voting could be adapted to other stochastic AI systems to improve consistency without additional training.

Load-bearing premise

The expert decision trees capture the complete and accurate logic for the labeling task, and the vision-language models can correctly assess the semantic conditions at each node in the tree.

What would settle it

A direct test would be to apply the framework to a new forestry dataset with a different expert decision tree structure and measure whether the automated labels match independent expert annotations at the same rate as on the original testbed, without any framework modifications.

Figures

Figures reproduced from arXiv: 2606.31976 by Collin Hargreaves, Huiqi Wang, Nicholas Saban, Shiyi Chen.

**Figure 1.** Figure 1: Overview of the D3 Framework. The architecture decouples domain-specific logic through a fixed Logic Primitive Inventory (LPI). Top: the Neural Rule Transpiler (NRT) translates an unstructured expert rule ρ into a structured JSON tree configuration T via a single inference call. Bottom: distinct expert strategies (ρA, ρB) are compiled into different executable tree structures without altering the underlyin… view at source ↗

**Figure 2.** Figure 2: The VLM agent. Each VLM node receives a nodespecific prompt ϑ(v), image modalities, and structured fields. To suppress stochasticity, K=3 independent samples at temperature τ=0.2 are aggregated by majority vote. Two node classes use deterministic fallbacks when the input alone is sufficient. 3.1.5. CONFIGURATION GENERALIZABILITY The central property of the D3 Framework is zeromodification reconfigurabili… view at source ↗

read the original abstract

Human-labeled data are widely used as reference annotations in ML, despite known variability across annotators in many expert-driven domains. In addition, expert annotation is slow, inconsistent, and remains a major bottleneck for scaling tasks like tree height bias classification in forestry remote sensing. We propose a multi-agent system (MAS) that orchestrates expert decision trees with Vision-Language Models (VLMs), treating the decision tree as a structural prior while VLMs perform localized semantic perception at individual nodes, with multi-agent voting to mitigate VLM stochasticity. We formalize a Decoupled Declarative Decision (D3) Framework that enables zero-modification generalization across diverse expert-defined decision structures. On a tree bias classification testbed, our framework outperforms supervised ML baselines and reduces the amount of expert labeling effort required. These results suggest that agentic orchestration of VLMs with expert priors can reproduce expert-defined labeling procedures at substantially lower annotation cost while maintaining interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TreeAgent combines expert decision trees with VLMs in a multi-agent setup via the D3 framework, but the empirical claims rest on unshown details.

read the letter

The paper's main contribution is TreeAgent, a multi-agent system that treats compiled expert decision trees as structural priors to direct VLMs for node-level perception in forestry bias labeling, with voting to stabilize outputs. The Decoupled Declarative Decision (D3) Framework is presented as the mechanism for applying this to varied expert structures without code changes.

This is new in the specific orchestration for domain labeling tasks and the formalization of D3 for generalization. It does a reasonable job framing the annotation bottleneck in expert fields like remote sensing and offering an interpretable alternative to pure supervised learning.

The soft spot is the central claim of outperformance over baselines plus reduced expert effort on a testbed. The abstract gives no experimental setup, baselines, metrics, sample sizes, or error analysis, so the results cannot be assessed from the text. The architecture itself looks internally consistent with no circularity or mismatched assumptions.

This is for applied researchers working on agentic systems or scaling annotations in environmental monitoring. A reader looking for concrete frameworks to adapt expert rules to VLMs could extract useful ideas from the description.

It deserves peer review to examine the actual experiments and implementation.

Referee Report

2 major / 1 minor

Summary. The paper proposes TreeAgent, a multi-agent system (MAS) that treats expert decision trees as structural priors, uses VLMs for localized semantic perception at individual nodes, and applies multi-agent voting to mitigate VLM stochasticity. It introduces a Decoupled Declarative Decision (D3) Framework for zero-modification generalization across expert-defined structures. The central empirical claim is that the framework outperforms supervised ML baselines on a tree bias classification testbed in forestry remote sensing while reducing expert labeling effort and preserving interpretability.

Significance. If the results hold, this could meaningfully advance automated labeling in expert domains with high annotation variability and cost, by providing an interpretable alternative to pure supervised learning through the orchestration of symbolic priors and VLMs. The D3 framework's emphasis on generalization without modification and the explicit use of compiled expert rules are strengths that could support broader adoption if validated.

major comments (2)

[Abstract] Abstract: The claim that the framework 'outperforms supervised ML baselines and reduces the amount of expert labeling effort required' is presented without any reported metrics, baselines, dataset details, error bars, or experimental setup; this absence makes the central empirical contribution unverifiable and is load-bearing for the paper's main assertion.
[Results] Results (implied by abstract claims): No tables, figures, or quantitative comparisons are referenced that would allow assessment of the outperformance or labeling reduction; without these, the testbed evaluation cannot be evaluated for soundness.

minor comments (1)

[Abstract] Abstract: The acronym 'D3' and 'MAS' are introduced without expansion on first use, which reduces immediate clarity for readers unfamiliar with the terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive comments on the presentation of our empirical results. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the framework 'outperforms supervised ML baselines and reduces the amount of expert labeling effort required' is presented without any reported metrics, baselines, dataset details, error bars, or experimental setup; this absence makes the central empirical contribution unverifiable and is load-bearing for the paper's main assertion.

Authors: We agree that the abstract presents the central claim without quantitative support, which limits immediate verifiability. The full manuscript reports the experimental setup, dataset, metrics (with error bars), baselines, and labeling effort reduction in the Results section. We will revise the abstract to incorporate key quantitative highlights from those experiments, such as accuracy gains and labeling reduction percentages, to make the claim self-contained and verifiable at the abstract level. revision: yes
Referee: [Results] Results (implied by abstract claims): No tables, figures, or quantitative comparisons are referenced that would allow assessment of the outperformance or labeling reduction; without these, the testbed evaluation cannot be evaluated for soundness.

Authors: The manuscript contains tables and figures presenting the quantitative comparisons, metrics, and labeling effort analysis in the Results section. We acknowledge that the in-text references to these elements may not have been sufficiently explicit or repeated. We will revise the Results section to add clear, repeated citations (e.g., 'as shown in Table 1 and Figure 3') for all quantitative claims, ensuring the supporting evidence is immediately locatable. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an engineering framework (expert decision trees as structural priors, VLMs for node-level perception, multi-agent voting, and D3 for generalization) whose central claims are supported by empirical results on a forestry testbed rather than by any derivation chain. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described architecture. The method is self-contained against external benchmarks (supervised ML baselines) with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim depends on assumptions about VLM perception capabilities and the effectiveness of expert priors in the MAS; no free parameters or invented entities are specified in the abstract.

axioms (2)

domain assumption Vision-Language Models can perform localized semantic perception at individual nodes of expert decision trees
Framework treats this as the mechanism for perception at decision nodes.
domain assumption Multi-agent voting mitigates VLM stochasticity sufficiently for reliable labeling
Invoked to address variability in model outputs.

pith-pipeline@v0.9.1-grok · 5704 in / 1299 out tokens · 42734 ms · 2026-07-01T05:08:56.066539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 5 canonical work pages

[1]

arXiv Preprint , year=

Integrating Expert Knowledge into Logical Programs via LLMs , author=. arXiv Preprint , year=
[2]

and Takashima, Yoshiki and Paulsen, Brandon and Dodds, Josiah and Kroening, Daniel , title=

Yang, Aidan Z.H. and Takashima, Yoshiki and Paulsen, Brandon and Dodds, Josiah and Kroening, Daniel , title=. 2024 , url=

2024
[3]

Debate or vote: Which yields better decisions in multi- agent large language models? InAdvances in Neural Information Processing Systems, 2025

Debate or vote: Which yields better decisions in multi-agent large language models? , author=. arXiv preprint arXiv:2508.17536 , year=

work page arXiv
[4]

2016 , doi=

Chen, Tianqi and Guestrin, Carlos , booktitle=. 2016 , doi=

2016
[5]

Ke, Guolin and Meng, Qi and Finley, Thomas and Wang, Taifeng and Chen, Wei and Ma, Weidong and Ye, Qiwei and Liu, Tie-Yan , booktitle=
[6]

Advances in Neural Information Processing Systems , volume=

Why do tree-based models still outperform deep learning on typical tabular data? , author=. Advances in Neural Information Processing Systems , volume=
[7]

Machine Learning , volume=

Random Forests , author=. Machine Learning , volume=. 2001 , doi=

2001
[8]

and Barnett, David T

Thorpe, Andrea S. and Barnett, David T. and Elmendorf, Sarah C. and Hinckley, Eve-Lyn S. and Hoekman, David and Jones, Katherine D. and LeVan, Katherine E. and Meier, Courtney L. and Stanish, Lee F. and Thibault, Katherine M. , journal=. Introduction to the sampling designs of the. 2016 , doi=

2016
[9]

Remote Sensing of Environment , volume=

The importance of spatial detail: Assessing the utility of individual crown information and scaling approaches for lidar-based biomass density estimation , author=. Remote Sensing of Environment , volume=. 2015 , doi=

2015
[10]

and Bahlai, Christie A

Pau, Stephanie and Dee, Laura E. and Bahlai, Christie A. and Fromm, Emma and Key, Kristina J. , journal=. Poor relationships between. 2022 , doi=

2022
[11]

Chawla, N. V. and Bowyer, K. W. and Hall, L. O. and Kegelmeyer, W. P. , journal=. 2002 , doi=

2002
[12]

Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=

Focal Loss for Dense Object Detection , author=. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=. 2017 , doi=

2017
[13]

and Su, Hao and Mo, Kaichun and Guibas, Leonidas J

Qi, Charles R. and Su, Hao and Mo, Kaichun and Guibas, Leonidas J. , booktitle=. 2017 , doi=

2017
[14]

Nature , volume=

Accurate predictions on small data with a tabular foundation model , author=. Nature , volume=. 2025 , doi=

2025
[15]

Workshop on Challenges in Representation Learning, ICML , year=

Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , author=. Workshop on Challenges in Representation Learning, ICML , year=
[16]

arXiv preprint arXiv:2505.14361 , year=

Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives , author=. arXiv preprint arXiv:2505.14361 , year=

work page arXiv
[17]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=
[18]

2014 , issn =

Simulating the impacts of error in species and height upon tree volume derived from airborne laser scanning data , journal =. 2014 , issn =. doi:https://doi.org/10.1016/j.foreco.2014.05.011 , url =

work page doi:10.1016/j.foreco.2014.05.011 2014
[19]

and O'Sullivan, M

Friedlingstein, P. and O'Sullivan, M. and Jones, M. W. and others , TITLE =. Earth System Science Data , VOLUME =. 2025 , NUMBER =

2025
[20]

2019 , issn =

Is field-measured tree height as reliable as believed – A comparison study of tree height estimates from field measurement, airborne laser scanning and terrestrial laser scanning in a boreal forest , journal =. 2019 , issn =. doi:https://doi.org/10.1016/j.isprsjprs.2018.11.008 , url =

work page doi:10.1016/j.isprsjprs.2018.11.008 2019
[21]

Terryn, Louise and Calders, Kim and Meunier, Félicien and Bauters, Marijn and Boeckx, Pascal and Brede, Benjamin and Burt, Andrew and Chave, Jerome and da Costa, Antonio Carlos Lola and D'hont, Barbara and Disney, Mathias and Jucker, Tommaso and Lau, Alvaro and Laurance, Susan G. W. and Maeda, Eduardo Eiji and Meir, Patrick and Krishna Moorthy, Sruthi M. ...

work page doi:10.1111/gcb.17473

[1] [1]

arXiv Preprint , year=

Integrating Expert Knowledge into Logical Programs via LLMs , author=. arXiv Preprint , year=

[2] [2]

and Takashima, Yoshiki and Paulsen, Brandon and Dodds, Josiah and Kroening, Daniel , title=

Yang, Aidan Z.H. and Takashima, Yoshiki and Paulsen, Brandon and Dodds, Josiah and Kroening, Daniel , title=. 2024 , url=

2024

[3] [3]

Debate or vote: Which yields better decisions in multi- agent large language models? InAdvances in Neural Information Processing Systems, 2025

Debate or vote: Which yields better decisions in multi-agent large language models? , author=. arXiv preprint arXiv:2508.17536 , year=

work page arXiv

[4] [4]

2016 , doi=

Chen, Tianqi and Guestrin, Carlos , booktitle=. 2016 , doi=

2016

[5] [5]

Ke, Guolin and Meng, Qi and Finley, Thomas and Wang, Taifeng and Chen, Wei and Ma, Weidong and Ye, Qiwei and Liu, Tie-Yan , booktitle=

[6] [6]

Advances in Neural Information Processing Systems , volume=

Why do tree-based models still outperform deep learning on typical tabular data? , author=. Advances in Neural Information Processing Systems , volume=

[7] [7]

Machine Learning , volume=

Random Forests , author=. Machine Learning , volume=. 2001 , doi=

2001

[8] [8]

and Barnett, David T

Thorpe, Andrea S. and Barnett, David T. and Elmendorf, Sarah C. and Hinckley, Eve-Lyn S. and Hoekman, David and Jones, Katherine D. and LeVan, Katherine E. and Meier, Courtney L. and Stanish, Lee F. and Thibault, Katherine M. , journal=. Introduction to the sampling designs of the. 2016 , doi=

2016

[9] [9]

Remote Sensing of Environment , volume=

The importance of spatial detail: Assessing the utility of individual crown information and scaling approaches for lidar-based biomass density estimation , author=. Remote Sensing of Environment , volume=. 2015 , doi=

2015

[10] [10]

and Bahlai, Christie A

Pau, Stephanie and Dee, Laura E. and Bahlai, Christie A. and Fromm, Emma and Key, Kristina J. , journal=. Poor relationships between. 2022 , doi=

2022

[11] [11]

Chawla, N. V. and Bowyer, K. W. and Hall, L. O. and Kegelmeyer, W. P. , journal=. 2002 , doi=

2002

[12] [12]

Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=

Focal Loss for Dense Object Detection , author=. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=. 2017 , doi=

2017

[13] [13]

and Su, Hao and Mo, Kaichun and Guibas, Leonidas J

Qi, Charles R. and Su, Hao and Mo, Kaichun and Guibas, Leonidas J. , booktitle=. 2017 , doi=

2017

[14] [14]

Nature , volume=

Accurate predictions on small data with a tabular foundation model , author=. Nature , volume=. 2025 , doi=

2025

[15] [15]

Workshop on Challenges in Representation Learning, ICML , year=

Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , author=. Workshop on Challenges in Representation Learning, ICML , year=

[16] [16]

arXiv preprint arXiv:2505.14361 , year=

Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives , author=. arXiv preprint arXiv:2505.14361 , year=

work page arXiv

[17] [17]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

[18] [18]

2014 , issn =

Simulating the impacts of error in species and height upon tree volume derived from airborne laser scanning data , journal =. 2014 , issn =. doi:https://doi.org/10.1016/j.foreco.2014.05.011 , url =

work page doi:10.1016/j.foreco.2014.05.011 2014

[19] [19]

and O'Sullivan, M

Friedlingstein, P. and O'Sullivan, M. and Jones, M. W. and others , TITLE =. Earth System Science Data , VOLUME =. 2025 , NUMBER =

2025

[20] [20]

2019 , issn =

Is field-measured tree height as reliable as believed – A comparison study of tree height estimates from field measurement, airborne laser scanning and terrestrial laser scanning in a boreal forest , journal =. 2019 , issn =. doi:https://doi.org/10.1016/j.isprsjprs.2018.11.008 , url =

work page doi:10.1016/j.isprsjprs.2018.11.008 2019

[21] [21]

Terryn, Louise and Calders, Kim and Meunier, Félicien and Bauters, Marijn and Boeckx, Pascal and Brede, Benjamin and Burt, Andrew and Chave, Jerome and da Costa, Antonio Carlos Lola and D'hont, Barbara and Disney, Mathias and Jucker, Tommaso and Lau, Alvaro and Laurance, Susan G. W. and Maeda, Eduardo Eiji and Meir, Patrick and Krishna Moorthy, Sruthi M. ...

work page doi:10.1111/gcb.17473