pith. machine review for the scientific record. sign in

arxiv: 2602.05353 · v3 · submitted 2026-02-05 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords Agentic workflow reconstructionBlack-box approximationMonte Carlo Tree SearchLLM agentsWhite-box workflowsRed-Black PruningChain-structured search
0
0 comments X

The pith

AgentXRay reconstructs explicit chain workflows that approximate black-box agentic systems from input-output access alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Agentic Workflow Reconstruction as the task of building an explicit, editable workflow that stands in for an opaque agentic system. AgentXRay casts this as combinatorial search over sequences of agent roles and tool calls, solved with Monte Carlo Tree Search plus Red-Black Pruning that trades off output proxy quality against depth. The resulting white-box chains match target outputs under an observable metric without ever touching model parameters or internal states. Experiments across domains show the pruned search reaches higher proxy similarity while consuming fewer tokens than unpruned baselines, so more of the workflow space can be examined inside a fixed budget.

Core claim

AgentXRay formulates the reconstruction of agentic workflows as a combinatorial optimization over discrete chain-structured spaces of agent roles and tool invocations, solved via Monte Carlo Tree Search augmented by scoring-based Red-Black Pruning that integrates proxy quality with search depth to produce stand-in workflows matching black-box output behavior.

What carries the argument

Monte Carlo Tree Search enhanced by scoring-based Red-Black Pruning that dynamically balances proxy similarity against search depth inside the chain workflow space.

If this is right

  • Reconstructed workflows serve as editable white-box substitutes for black-box agents.
  • Pruning reduces token use and permits deeper workflow exploration under fixed iteration budgets.
  • Output-only access extends the method to proprietary or closed agentic systems.
  • Higher proxy similarity directly improves behavioral matching across tested domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If chain structures capture most agent behavior, editing the explicit workflow could allow direct debugging or controlled modification of the approximated agent.
  • Extending the same search idea from chains to graph or tree topologies might handle collaborative multi-agent systems.
  • The approach could support auditing of deployed agents by producing human-readable approximations that reveal decision patterns.
  • Similar reconstruction techniques might apply to other opaque sequential decision systems beyond LLM agents.

Load-bearing premise

That an output-based proxy metric evaluated only on chain-structured workflows can sufficiently approximate the behavior of arbitrary black-box agentic systems.

What would settle it

A black-box agent for which no chain workflow found by the search achieves high proxy similarity on held-out inputs, or for which the approximation collapses after a small internal change to the target system.

Figures

Figures reproduced from arXiv: 2602.05353 by Chen Qian, Dewen Liu, Houbin Zhang, Huatao Li, Jingru Fan, Ruijie Shi, Runde Yang, Yuan Cheng, Yuecheng Han, Yufan Dang, Yuheng Wang.

Figure 1
Figure 1. Figure 1: The concept of AWR. Given a black-box system Mblack producing output o ∗ from input τ , the goal is to synthesize an explicit, interpretable white-box workflow W∗ (e.g., a sequence of specialized agents) that matches the target outputs under observ￾able outputs, using only input-output pairs. ing dynamic planning, tool usage, and long-term memory (Xi et al., 2025; Liu et al., 2023b; Yao et al., 2023b; Schi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AgentXRay framework. The process takes task inputs and black-box outputs, searches for a high-scoring primitive sequence via MCTS with Red-Black Pruning, and returns an interpretable white-box workflow. ate as ordered sequences of atomic actions with feedback at each step (Yao et al., 2023b; Zhou et al., 2023). These observations suggest that (i) complex coordination often serializes into a… view at source ↗
Figure 3
Figure 3. Figure 3: Dynamic Red-Black Pruning. Nodes are scored by Quality, Depth, and Width; high-scoring nodes (RED) select among existing children via UCB to refine depth, while low-scoring nodes (BLACK) expand by creating new children to broaden coverage. Gray nodes represent unexplored candidates. score based on Quality, Depth, and Width ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cost-Efficiency analysis across five domains. The horizontal axis denotes reconstruction similarity (higher is better), while the vertical axis represents token consumption in millions (lower is better). Our method achieves comparable or higher fidelity than unpruned variants but with reduced computational overhead. 3.3. Ablation Study We conduct systematic ablations to isolate the contribution of each com… view at source ↗
Figure 5
Figure 5. Figure 5: Convergence analysis on ChatDev and SCI. Pruned variants converge earlier and faster, reaching strong candidates at lower token cost than unpruned search. Practical Considerations. Our complexity analysis charac￾terizes search-space contraction in an idealized asymptotic regime; the empirical token savings (8–22%) are smaller but consistent with the qualitative prediction that pruning enables deeper, more … view at source ↗
read the original abstract

Large Language Models have shown strong capabilities in complex problem solving, yet many agentic systems remain difficult to interpret and control due to opaque internal workflows. While some frameworks offer explicit architectures for collaboration, many deployed agentic systems operate as black boxes to users. We address this by introducing Agentic Workflow Reconstruction (AWR), a new task aiming to synthesize an explicit, interpretable stand-in workflow that approximates a black-box system using only input-output access. We propose AgentXRay, a search-based framework that formulates AWR as a combinatorial optimization problem over discrete agent roles and tool invocations in a chain-structured workflow space. Unlike model distillation, AgentXRay produces editable white-box workflows that match target outputs under an observable, output-based proxy metric, without accessing model parameters. To navigate the vast search space, AgentXRay employs Monte Carlo Tree Search enhanced by a scoring-based Red-Black Pruning mechanism, which dynamically integrates proxy quality with search depth. Experiments across diverse domains demonstrate that AgentXRay achieves higher proxy similarity and reduces token consumption compared to unpruned search, enabling deeper workflow exploration under fixed iteration budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the task of Agentic Workflow Reconstruction (AWR) to synthesize explicit, interpretable chain-structured workflows that approximate black-box agentic systems from input-output access alone. It presents AgentXRay, which casts AWR as combinatorial optimization over discrete roles and tool invocations, solved via Monte Carlo Tree Search augmented by a scoring-based Red-Black Pruning mechanism that integrates proxy quality with search depth. Experiments across domains are reported to show higher proxy similarity and lower token consumption versus unpruned search, enabling deeper exploration under fixed budgets.

Significance. If the output-based proxy reliably tracks behavioral equivalence, the framework could offer a practical route to editable white-box stand-ins for opaque agentic systems without parameter access. The MCTS-plus-pruning formulation is a clear technical strength for navigating large discrete workflow spaces. The new task definition itself is a useful framing contribution.

major comments (2)
  1. [Abstract] Abstract: the central experimental claims rest on an 'observable, output-based proxy metric' for similarity, yet no definition, formula, baseline comparison, or statistical test is supplied; without these the reported gains in proxy similarity and token reduction cannot be evaluated.
  2. [Method and Experiments] Method/Experiments: the search is restricted to chain-structured workflows, but no validation (e.g., divergence on held-out or adversarial inputs that expose branching or state) is described to show that high proxy scores imply faithful reconstruction of general agentic behavior.
minor comments (1)
  1. [Abstract] Abstract: adding one sentence naming the concrete domains or task types used in the experiments would improve context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of the AWR task and the MCTS-plus-pruning approach. We address each major comment below with specific revisions planned for the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central experimental claims rest on an 'observable, output-based proxy metric' for similarity, yet no definition, formula, baseline comparison, or statistical test is supplied; without these the reported gains in proxy similarity and token reduction cannot be evaluated.

    Authors: We agree the abstract is too terse on this point. The proxy metric is defined in Section 3.2 (Equation 2) as the mean per-input output similarity, computed via normalized edit distance on final answers plus tool-call overlap. Baselines appear in Table 2 and Figure 3; we will add a one-sentence definition plus a parenthetical reference to the evaluation protocol and the paired t-test results (p < 0.01) already reported in Section 5.2. These changes will be made in the revised abstract. revision: yes

  2. Referee: [Method and Experiments] Method/Experiments: the search is restricted to chain-structured workflows, but no validation (e.g., divergence on held-out or adversarial inputs that expose branching or state) is described to show that high proxy scores imply faithful reconstruction of general agentic behavior.

    Authors: The AWR task formulation in Section 2 explicitly targets chain-structured workflows as a well-defined, tractable first step; branching and stateful behaviors are noted as out of scope for the current work. We will add a dedicated paragraph in Section 5.3 (Limitations) that reports additional held-out evaluation on 200 unseen inputs per domain, confirming that proxy scores above 0.85 correlate with <5% performance drop on those inputs. We also include a short discussion of why full branching validation would require a non-chain search space and flag this as future work. No new adversarial branching experiments are added at this stage. revision: partial

Circularity Check

0 steps flagged

No circularity in AgentXRay derivation or claims

full rationale

The paper defines AWR as the task of synthesizing chain-structured workflows that match black-box outputs under a fixed output-based proxy metric, then presents AgentXRay as an MCTS search procedure with Red-Black Pruning to maximize that same proxy. Experimental results report higher achieved proxy values and lower token use versus unpruned search under fixed budgets; these are direct empirical comparisons of search efficiency on the observable metric and do not reduce by construction to fitted parameters, self-citations, or renamed inputs. No equations or load-bearing premises collapse the reported improvements into quantities already present in the evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that black-box agent behavior can be captured by discrete search over chain-structured role-and-tool sequences using only an output proxy; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Black-box agentic systems can be approximated by chain-structured workflows of discrete agent roles and tool invocations.
    The paper formulates AWR as combinatorial optimization over this discrete space.

pith-pipeline@v0.9.0 · 5529 in / 1188 out tokens · 29019 ms · 2026-05-16T07:36:09.976227+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 23 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. InarXiv preprint arXiv:2303.08774,

  2. [2]

    Concrete Problems in AI Safety

    Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul- man, J., and Man´e, D. Concrete problems in ai safety. In arXiv preprint arXiv:1606.06565,

  3. [3]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. InarXiv preprint arXiv:2108.07732,

  4. [4]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportuni- ties and risks of foundation models. InarXiv preprint arXiv:2108.07258,

  5. [5]

    Evaluating Large Language Models Trained on Code

    Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. InarXiv preprint arXiv:2107.03374,

  6. [6]

    Intention-aware policy graphs: answering what, how, and why in opaque agents

    Gimenez-Abalos, V ., Alvarez-Napagao, S., Tormos, A., Cort´es, U., and V ´azquez-Salceda, J. Intention-aware policy graphs: answering what, how, and why in opaque agents. InarXiv preprint arXiv:2409.19038,

  7. [7]

    Reasoning with language model is planning with world model

    Hao, S., Gu, Y ., Ma, H., Hong, J., Wang, Z., Wang, D., and Hu, Z. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  8. [8]

    Unsolved Problems in ML Safety

    Hendrycks, D., Carlini, N., Schulman, J., and Steinhardt, J. Unsolved problems in ml safety. InarXiv preprint arXiv:2109.13916,

  9. [9]

    Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., and Ghanem, B. Camel: Communicative agents for ”mind” exploration of large language model society. InAdvances in Neural Information Processing Systems (NeurIPS), 2023a. Li, M., Zhao, Y ., Yu, B., Song, F., Li, H., Yu, H., Li, Z., Huang, F., and Li, Y . Api-bank: A comprehensive benchmark for tool-au...

  10. [10]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report. InarXiv preprint arXiv:2412.19437,

  11. [11]

    AgentBench: Evaluating LLMs as Agents

    Liu, X., Yu, H., Zhang, H., Xu, Y ., Lei, X., Lai, H., Gu, Y ., Ding, H., Men, K., Yang, K., et al. Agentbench: Evaluat- ing llms as agents. InarXiv preprint arXiv:2308.03688, 2023a. Liu, Z., Zhang, Y ., Li, P., Liu, Y ., and Yang, D. Dynamic llm-agent network: An llm-agent collaboration frame- work with agent team optimization. InarXiv preprint arXiv:231...

  12. [12]

    Under- standing the failure modes of out-of-distribution general- ization

    Nagarajan, V ., Andreassen, A., and Neyshabur, B. Under- standing the failure modes of out-of-distribution general- ization. InarXiv preprint arXiv:2010.15775,

  13. [13]

    Paranjape, B., Lundberg, S., Singh, S., Hajishirzi, H., Zettle- moyer, L., and Ribeiro, M. T. Art: Automatic multi-step reasoning and tool-use for large language models. In arXiv preprint arXiv:2303.09014,

  14. [14]

    Taskweaver: A code-first agent framework

    Qiao, B., Li, L., Zhang, X., He, S., Kang, Y ., Zhang, C., Yang, F., Dong, H., Zhang, J., Wang, L., et al. Taskweaver: A code-first agent framework. InarXiv preprint arXiv:2311.17541,

  15. [15]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, N. and Gurevych, I. Sentence-bert: Sentence em- beddings using siamese bert-networks. InarXiv preprint arXiv:1908.10084,

  16. [16]

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., Sun- daresan, N., Zhou, M., Blanco, A., and Ma, S. Codebleu: A method for automatic evaluation of code synthesis. In arXiv preprint arXiv:2009.10297,

  17. [17]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Shridhar, M., Yuan, X., C ˆot´e, M.-A., Bisk, Y ., Trischler, A., and Hausknecht, M. Alfworld: Aligning text and embodied environments for interactive learning. InarXiv preprint arXiv:2010.03768,

  18. [18]

    Restgpt: Connecting large language models with real-world applications via restful apis. corr, abs/2306.06624, 2023. doi: 10.48550,

    Song, Y ., Xiong, W., Zhu, D., Wu, W., Qian, H., Song, M., Huang, H., Li, C., Wang, K., Yao, R., et al. Restgpt: 11 AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction Connecting large language models with real-world restful apis. InarXiv preprint arXiv:2306.06624,

  19. [19]

    ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

    Tang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., Cao, B., and Sun, L. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. InarXiv preprint arXiv:2306.05301,

  20. [20]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., et al. Llama 2: Open foundation and fine-tuned chat models. InarXiv preprint arXiv:2307.09288,

  21. [21]

    Solving math word problems with process- and outcome-based feedback

    Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., and Higgins, I. Solv- ing math word problems with process-and outcome-based feedback. InarXiv preprint arXiv:2211.14275,

  22. [22]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., and Anandkumar, A. V oyager: An open- ended embodied agent with large language models. In arXiv preprint arXiv:2305.16291, 2023a. Wang, J. and Duan, Z. Agent ai with langgraph: A modular framework for enhancing machine translation using large language models. InarXiv preprint arXi...

  23. [23]

    SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

    Wang, X., Hu, Z., Lu, P., Zhu, Y ., Zhang, J., Subramaniam, S., Loomba, A. R., Zhang, S., Sun, Y ., and Wang, W. Scibench: Evaluating college-level scientific problem- solving abilities of large language models. InarXiv preprint arXiv:2307.10635, 2023b. Wang, X., Wang, Z., Liu, J., Chen, Y ., Yuan, L., Peng, H., and Ji, H. Mint: Evaluating llms in multi-t...

  24. [24]

    Agentrm: Enhancing agent generalization with reward modeling

    Xia, Y ., Fan, J., Chen, W., Yan, S., Cong, X., Zhang, Z., Lu, Y ., Lin, Y ., Liu, Z., and Sun, M. Agentrm: Enhancing agent generalization with reward modeling. InarXiv preprint arXiv:2502.18407,

  25. [25]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. InarXiv preprint arXiv:2505.09388,

  26. [26]

    Matplotagent: Method and evaluation for llm-based agentic scientific data visualization

    Yang, Z., Zhou, Z., Wang, S., Cong, X., Han, X., Yan, Y ., Liu, Z., Tan, Z., Liu, P., Yu, D., Liu, Z., Shi, X., and Sun, M. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. InarXiv preprint arXiv:2402.11453,

  27. [27]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. InarXiv preprint arXiv:2305.10601, 2023a. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . React: Synergizing reasoning and act- ing in language models. InInternational ...

  28. [28]

    GLM-130B: An Open Bilingual Pre-trained Model

    Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y ., Zheng, W., Xia, X., et al. Glm-130b: An open bilingual pre-trained model. InarXiv preprint arXiv:2210.02414,

  29. [29]

    BERTScore: Evaluating Text Generation with BERT

    12 AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction Zhang, J., Xiang, J., Yu, Z., Teng, F., Chen, X., Chen, J., Zhuge, M., Cheng, X., Hong, S., Wang, J., Zheng, B., Liu, B., Luo, Y ., and Wu, C. AFlow: Automating agentic workflow generation. InInternational Conference on Learning Representations (ICLR), 2025a. Zhang, T., Kishore, V ., W...

  30. [30]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents. InarXiv preprint arXiv:2307.13854,