Bayesian Uncertainty Propagation for Agentic RAG Pipelines: A Proof-of-Concept Study on Multi-Hop Question Answering

Connor Walker; Koorosh Aslansefat; Louis Donaldson; Yiannis Papadopoulos

arxiv: 2607.00972 · v1 · pith:LVAT434Inew · submitted 2026-07-01 · 💻 cs.AI

Bayesian Uncertainty Propagation for Agentic RAG Pipelines: A Proof-of-Concept Study on Multi-Hop Question Answering

Louis Donaldson , Connor Walker , Koorosh Aslansefat , Yiannis Papadopoulos This is my paper

Pith reviewed 2026-07-02 12:22 UTC · model grok-4.3

classification 💻 cs.AI

keywords Bayesian uncertainty propagationAgentic RAGmulti-hop question answeringuncertainty estimationHotpotQAStrategyQAfailure detection

0 comments

The pith

Bayesian networks propagate uncertainty signals through Agentic RAG stages to estimate failure risk in multi-hop question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that uncertainty signals generated at planner, evaluator, and generator stages in an Agentic RAG system can be combined using a Bayesian Network to produce system-level uncertainty estimates and identify potential failure points. This approach is tested on multi-hop question answering benchmarks, revealing that it performs better when uncertainty builds across reasoning hops. A sympathetic reader would care because reliable uncertainty monitoring is needed for safe deployment of these systems in complex tasks. The results highlight both promise and current limitations from signal quality.

Core claim

Uncertainty signals derived from semantic divergence and generator self-evaluation are propagated through a Bayesian Network to estimate system-level uncertainty and provide node-level indicators of potential failure points across the Agentic RAG workflow for multi-hop question answering.

What carries the argument

Bayesian Network integrating uncertainty signals from semantic divergence and generator self-evaluation at planner, evaluator, and generator stages.

Load-bearing premise

The uncertainty signals from semantic divergence and generator self-evaluation are reliable enough and independent enough to be valid inputs for the Bayesian network.

What would settle it

An experiment showing that the Bayesian network's uncertainty estimates do not improve prediction of actual system failures compared to using individual stage signals alone on a multi-hop QA dataset.

Figures

Figures reproduced from arXiv: 2607.00972 by Connor Walker, Koorosh Aslansefat, Louis Donaldson, Yiannis Papadopoulos.

read the original abstract

Trustworthy deployment of Agentic Retrieval-Augmented Generation (RAG) systems requires mechanisms for estimating when multi-stage reasoning pipelines may fail. This paper presents an uncertainty-aware Agentic Retrieval-Augmented Generation (RAG) framework in which planner, evaluator and generator stages produce uncertainty signals derived from semantic divergence and generator self-evaluation. These signals are propagated through a Bayesian Network (BN) to estimate system-level uncertainty and provide node-level indicators of potential failure points across the workflow. The approach is evaluated on StrategyQA and HotpotQA using GPT-3.5-Turbo and GPT-4.1-Nano, with Area Under the Receiver Operating Characteristic Curve (AUROC), Area Under the Accuracy-Rejection Curve (AUARC), Expected Calibration Error (ECE), and Brier Score used to assess discrimination, selective prediction and calibration. Results show that Bayesian propagation is more effective on HotpotQA, where uncertainty accumulates across multi-hop reasoning stages, while StrategyQA exposes limitations caused by miscalibration and unreliable upstream signals. The study positions Bayesian uncertainty propagation as a promising but preliminary mechanism for monitoring Agentic RAG systems, with future validation required in industrial domains such as Offshore Wind (OSW) maintenance decision support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bayesian networks applied to agentic RAG uncertainty, but the input signals from semantic divergence and self-evaluation lack shown calibration or independence.

read the letter

The main takeaway is that Bayesian propagation of uncertainty signals through planner-evaluator-generator stages performs better on HotpotQA than on StrategyQA, where miscalibration in the upstream signals limits the gains.

The paper applies an established technique—Bayesian networks for combining uncertainty across pipeline stages—to agentic RAG. It generates node signals via semantic divergence and generator self-evaluation, feeds them into the BN, and reports system-level uncertainty plus node-level failure indicators. Experiments use GPT-3.5-Turbo and GPT-4.1-Nano on StrategyQA and HotpotQA, with AUROC, AUARC, ECE, and Brier score to measure discrimination, selective prediction, and calibration.

It does a reasonable job of framing a practical monitoring approach and running the evaluation on two distinct QA tasks. The contrast between the datasets usefully shows where accumulated multi-hop uncertainty benefits from propagation.

The soft spots sit in the input assumptions. The signals are treated as valid probability estimates for the BN, yet the abstract itself flags unreliable upstream signals and miscalibration on StrategyQA. No per-node calibration checks, correlation with actual failure labels, or ablation of BN versus raw signals appear in the provided description, so the added value of the propagation step rests on an unverified modeling choice rather than direct evidence. Structure and parameter choices for the BN also receive little detail.

This is for researchers focused on trustworthy agentic systems who need concrete ways to track failure points in RAG pipelines. A reader looking for an initial implementation idea in this area would get some value from the setup and metrics, though the work is explicitly preliminary.

It deserves peer review. The experiments are on real benchmarks and the problem matters for deployment, even if the authors would need to add signal validation to make the central claim stronger.

Referee Report

3 major / 0 minor

Summary. The paper presents a proof-of-concept uncertainty-aware Agentic RAG framework in which planner, evaluator, and generator stages emit uncertainty signals derived from semantic divergence and generator self-evaluation; these signals are fed as inputs to a Bayesian network that produces system-level uncertainty estimates and node-level failure indicators. The framework is evaluated on StrategyQA and HotpotQA using GPT-3.5-Turbo and GPT-4.1-Nano, with performance measured by AUROC, AUARC, ECE, and Brier score. Results indicate that Bayesian propagation performs better on HotpotQA (where uncertainty accumulates across hops) than on StrategyQA (where upstream miscalibration limits effectiveness). The work positions the method as preliminary and suggests future validation in domains such as offshore wind maintenance.

Significance. If the input signals can be shown to be calibrated and independent, the approach supplies a structured, interpretable mechanism for tracing uncertainty through multi-stage agentic pipelines—an area of growing practical importance. The explicit acknowledgment of dataset-dependent limitations and the use of standard selective-prediction and calibration metrics are positive; however, the absence of any derivation of the BN parameters, per-node calibration diagnostics, or ablation isolating the propagation step keeps the contribution at the level of an interesting but unverified modeling idea rather than a substantiated technique.

major comments (3)

[Abstract] Abstract, paragraph describing the framework: the central modeling assumption—that semantic-divergence and self-evaluation signals constitute reliable, independent, calibrated probability estimates suitable as BN inputs—is load-bearing for the claim that propagation improves system-level uncertainty estimates, yet the abstract itself reports that StrategyQA exposes miscalibration and unreliable upstream signals; no per-node ECE, correlation with failure labels, or independence checks are described to support this assumption.
[Abstract] Abstract, evaluation paragraph: no description is given of how the BN structure or conditional probability tables were elicited or learned, nor are error bars, confidence intervals, or statistical significance tests reported for the metric differences between HotpotQA and StrategyQA; without these, the comparative claim that propagation is “more effective” on HotpotQA cannot be assessed for robustness.
[Abstract] Abstract, results sentence: the conclusion that Bayesian propagation is more effective where uncertainty accumulates rests on the unverified premise that the supplied node signals are valid; an ablation comparing BN outputs against raw signals or against a simple product/average baseline is required to isolate any benefit of the propagation step itself.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our proof-of-concept study. The comments correctly identify areas where additional methodological detail and controls would strengthen the presentation. We address each major comment below and commit to revisions that improve transparency and verifiability while preserving the preliminary nature of the work.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph describing the framework: the central modeling assumption—that semantic-divergence and self-evaluation signals constitute reliable, independent, calibrated probability estimates suitable as BN inputs—is load-bearing for the claim that propagation improves system-level uncertainty estimates, yet the abstract itself reports that StrategyQA exposes miscalibration and unreliable upstream signals; no per-node ECE, correlation with failure labels, or independence checks are described to support this assumption.

Authors: We agree that the assumption is central and that the abstract deliberately surfaces the StrategyQA limitations to illustrate when upstream signals undermine propagation. The current text does not include per-node diagnostics. In revision we will add per-node ECE values, Pearson correlations between node signals and failure labels, and pairwise independence checks among the three input signals. These additions will be placed in a new subsection of the methods and referenced from the abstract. revision: yes
Referee: [Abstract] Abstract, evaluation paragraph: no description is given of how the BN structure or conditional probability tables were elicited or learned, nor are error bars, confidence intervals, or statistical significance tests reported for the metric differences between HotpotQA and StrategyQA; without these, the comparative claim that propagation is “more effective” on HotpotQA cannot be assessed for robustness.

Authors: The referee is correct that the abstract (and current methods) omit these details. The BN structure follows the explicit pipeline stages (planner → evaluator → generator) and the CPTs were populated from a combination of empirical frequencies on a small held-out set and expert judgment on conditional failure probabilities. In the revision we will (i) describe this elicitation process explicitly, (ii) report bootstrap confidence intervals for all reported metrics, and (iii) add paired statistical tests (Wilcoxon signed-rank) on the AUROC and AUARC differences between the two datasets. revision: yes
Referee: [Abstract] Abstract, results sentence: the conclusion that Bayesian propagation is more effective where uncertainty accumulates rests on the unverified premise that the supplied node signals are valid; an ablation comparing BN outputs against raw signals or against a simple product/average baseline is required to isolate any benefit of the propagation step itself.

Authors: We accept that an ablation is necessary to isolate the contribution of the propagation step. The current manuscript does not contain such a comparison. In the revised version we will add a results subsection that reports system-level AUROC/AUARC when (a) raw node signals are used directly, (b) a simple product or average aggregation is applied, and (c) the full BN is used. This will allow readers to quantify the incremental value of the Bayesian step. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework propagates external signals and evaluates empirically

full rationale

The paper describes an empirical proof-of-concept framework in which uncertainty signals (from semantic divergence and generator self-evaluation) serve as inputs to a Bayesian network for propagation and system-level estimation. No equations, fitted parameters, or derivations are presented that reduce outputs to inputs by construction. Evaluation relies on standard external benchmarks (HotpotQA, StrategyQA) and metrics (AUROC, AUARC, ECE, Brier Score) without renaming known results or invoking self-citation chains for uniqueness. The central claims rest on observed performance differences across datasets rather than self-referential modeling assumptions, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility into parameters or axioms; the framework implicitly treats stage-level uncertainty signals as independent inputs whose joint distribution can be modeled by a BN.

axioms (1)

domain assumption Uncertainty signals derived from semantic divergence and generator self-evaluation are valid and sufficiently calibrated inputs for the Bayesian network
Stated in the framework description in the abstract.

pith-pipeline@v0.9.1-grok · 5761 in / 1299 out tokens · 19766 ms · 2026-07-02T12:22:34.562346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 19 canonical work pages · 3 internal anchors

[1]

Bilal, A., Ebert, D., Lin, B.: Llms for explainable ai: A comprehensive survey (2025),https://arxiv.org/abs/2504.00125

work page arXiv 2025
[2]

PHM Society European Conference8, 10 (06 2024)

DENG, H., Namoano, B., ZHENG, B., Khan, S., Ahmet Erkoyuncu, J.: From prediction to prescription: Large language model agent for context-aware main- tenance decision support. PHM Society European Conference8, 10 (06 2024). https://doi.org/10.36001/phme.2024.v8i1.4114

work page doi:10.36001/phme.2024.v8i1.4114 2024
[3]

parliament.uk/writtenevidence/140284/pdf/

Dogra, S., Erras, M., Farrell-Morris, C., Maple, C., Hairs, P., McCahon, W., Niven, T., Thornely, B., Zitani, L.: Generative ai in action: Opportunities & risk management in financial services january 2025 (04 2025),https://committees. parliament.uk/writtenevidence/140284/pdf/

2025
[4]

Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

Duan, J., Diffenderfer, J., Madireddy, S., Chen, T., Kailkhura, B., Xu, K.: Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision- making (2025),https://arxiv.org/abs/2506.17419 12 L. Donaldson et al

work page arXiv 2025
[5]

Nature630, 625–630 (06 2024)

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large language models using semantic entropy. Nature630, 625–630 (06 2024). https://doi.org/10.1038/s41586-024-07421-0,https://www.nature.com/ articles/s41586-024-07421-0#Sec2

work page doi:10.1038/s41586-024-07421-0 2024
[6]

Transactions of the Association for Computational Linguistics (TACL) , year =

Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., Berant, J.: Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. arXiv (Cornell University) (01 2021). https://doi.org/10.48550/arxiv.2101.02235

work page doi:10.48550/arxiv.2101.02235 2021
[7]

Halperin, I.: Prompt-response semantic divergence metrics for faithfulness hal- lucination and misalignment detection in large language models (2025),https: //arxiv.org/abs/2508.10192

work page arXiv 2025
[8]

Harbola, C., Purwar, A.: Prescriptive agents based on rag for automated mainte- nance (param) (2025),https://arxiv.org/abs/2508.04714

work page arXiv 2025
[9]

IEEE Access13, 151664–151670 (2025)

Hughes, P., Perinpanayagam, S., Ball, P.: Cost-efficiency and cost-effectiveness of xai in predictive maintenance. IEEE Access13, 151664–151670 (2025). https://doi.org/10.1109/access.2025.3601385

work page doi:10.1109/access.2025.3601385 2025
[10]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Kuhn, L., Gal, Y., Farquhar, S.: Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. (2023),https://arxiv. org/abs/2302.09664

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Journal of Loss Prevention in the Process Industries59, 44–53 (05 2019)

Liu, Z., Liu, Y.: A bayesian network based method for reliability analysis of sub- sea blowout preventer control system. Journal of Loss Prevention in the Process Industries59, 44–53 (05 2019). https://doi.org/10.1016/j.jlp.2019.03.004

work page doi:10.1016/j.jlp.2019.03.004 2019
[12]

Mishra, S., Niroula, S., Yadav, U., Thakur, D., Gyawali, S., Gaire, S.: Sok: Agen- tic retrieval-augmented generation (rag): Taxonomy, architectures, evaluation, and research directions (2026),https://arxiv.org/abs/2603.07379

work page arXiv 2026
[13]

https://doi.org/10.1007/s10844- 025-00940-w

Powell, C., Riccardi, A.: Generating textual explanations for scheduling systems leveragingthereasoningcapabilitiesoflargelanguagemodels.JournalofIntelligent Information Systems63, 1287–1337 (04 2025). https://doi.org/10.1007/s10844- 025-00940-w

work page doi:10.1007/s10844- 2025
[14]

Venkatachalam, S.: Integrating large language models with network optimization for interactive and explainable supply chain planning: A real-world case study (2025),https://arxiv.org/abs/2508.21622

work page arXiv 2025
[15]

arXiv preprint arXiv:2509.03768 (2025)

Walker, C., Aslansefat, K., Akram, M.N., Papadopoulos, Y.: Raguard: A novel ap- proach for in-context safe retrieval augmented generation for llms. arXiv preprint arXiv:2509.03768 (2025). https://doi.org/10.48550/arXiv.2509.03768,https:// arxiv.org/abs/2509.03768

work page doi:10.48550/arxiv.2509.03768 2025
[16]

arXiv preprint arXiv:2410.10852 (2024)

Walker, C., Rothon, C., Aslansefat, K., Papadopoulos, Y., Dethlefs, N.: Safellm: Domain-specific safety monitoring for large language models: A case study of offshore wind maintenance. arXiv preprint arXiv:2410.10852 (2024). https://doi.org/10.48550/arXiv.2410.10852,https://arxiv.org/abs/2410.10852

work page doi:10.48550/arxiv.2410.10852 2024
[17]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdinov, R., Manning, C.D.: Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv:1809.09600 [cs] (09 2018),https://arxiv.org/abs/1809.09600

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: React: Synergizing reasoning and acting in language models (03 2023). https://doi.org/10.48550/arXiv.2210.03629,https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2023
[19]

Yuan, X., Huang, Q., Guo, M., Ma, H., Xu, M., Liu, Z., He, X.: Towards next- generation intelligent maintenance: Collaborative fusion of large and small models (2025),https://arxiv.org/abs/2506.05854

work page arXiv 2025
[20]

Zhang,L.,Zhao,C.,Gao,Q.,Zhao,X.,Bai,G.,Lv,J.:Dschellm:Enablingdynamic scheduling through a fine-tuned dual-system large language model (2026),https: //arxiv.org/abs/2601.09100

work page arXiv 2026

[1] [1]

Bilal, A., Ebert, D., Lin, B.: Llms for explainable ai: A comprehensive survey (2025),https://arxiv.org/abs/2504.00125

work page arXiv 2025

[2] [2]

PHM Society European Conference8, 10 (06 2024)

DENG, H., Namoano, B., ZHENG, B., Khan, S., Ahmet Erkoyuncu, J.: From prediction to prescription: Large language model agent for context-aware main- tenance decision support. PHM Society European Conference8, 10 (06 2024). https://doi.org/10.36001/phme.2024.v8i1.4114

work page doi:10.36001/phme.2024.v8i1.4114 2024

[3] [3]

parliament.uk/writtenevidence/140284/pdf/

Dogra, S., Erras, M., Farrell-Morris, C., Maple, C., Hairs, P., McCahon, W., Niven, T., Thornely, B., Zitani, L.: Generative ai in action: Opportunities & risk management in financial services january 2025 (04 2025),https://committees. parliament.uk/writtenevidence/140284/pdf/

2025

[4] [4]

Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

Duan, J., Diffenderfer, J., Madireddy, S., Chen, T., Kailkhura, B., Xu, K.: Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision- making (2025),https://arxiv.org/abs/2506.17419 12 L. Donaldson et al

work page arXiv 2025

[5] [5]

Nature630, 625–630 (06 2024)

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large language models using semantic entropy. Nature630, 625–630 (06 2024). https://doi.org/10.1038/s41586-024-07421-0,https://www.nature.com/ articles/s41586-024-07421-0#Sec2

work page doi:10.1038/s41586-024-07421-0 2024

[6] [6]

Transactions of the Association for Computational Linguistics (TACL) , year =

Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., Berant, J.: Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. arXiv (Cornell University) (01 2021). https://doi.org/10.48550/arxiv.2101.02235

work page doi:10.48550/arxiv.2101.02235 2021

[7] [7]

Halperin, I.: Prompt-response semantic divergence metrics for faithfulness hal- lucination and misalignment detection in large language models (2025),https: //arxiv.org/abs/2508.10192

work page arXiv 2025

[8] [8]

Harbola, C., Purwar, A.: Prescriptive agents based on rag for automated mainte- nance (param) (2025),https://arxiv.org/abs/2508.04714

work page arXiv 2025

[9] [9]

IEEE Access13, 151664–151670 (2025)

Hughes, P., Perinpanayagam, S., Ball, P.: Cost-efficiency and cost-effectiveness of xai in predictive maintenance. IEEE Access13, 151664–151670 (2025). https://doi.org/10.1109/access.2025.3601385

work page doi:10.1109/access.2025.3601385 2025

[10] [10]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Kuhn, L., Gal, Y., Farquhar, S.: Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. (2023),https://arxiv. org/abs/2302.09664

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Journal of Loss Prevention in the Process Industries59, 44–53 (05 2019)

Liu, Z., Liu, Y.: A bayesian network based method for reliability analysis of sub- sea blowout preventer control system. Journal of Loss Prevention in the Process Industries59, 44–53 (05 2019). https://doi.org/10.1016/j.jlp.2019.03.004

work page doi:10.1016/j.jlp.2019.03.004 2019

[12] [12]

Mishra, S., Niroula, S., Yadav, U., Thakur, D., Gyawali, S., Gaire, S.: Sok: Agen- tic retrieval-augmented generation (rag): Taxonomy, architectures, evaluation, and research directions (2026),https://arxiv.org/abs/2603.07379

work page arXiv 2026

[13] [13]

https://doi.org/10.1007/s10844- 025-00940-w

Powell, C., Riccardi, A.: Generating textual explanations for scheduling systems leveragingthereasoningcapabilitiesoflargelanguagemodels.JournalofIntelligent Information Systems63, 1287–1337 (04 2025). https://doi.org/10.1007/s10844- 025-00940-w

work page doi:10.1007/s10844- 2025

[14] [14]

Venkatachalam, S.: Integrating large language models with network optimization for interactive and explainable supply chain planning: A real-world case study (2025),https://arxiv.org/abs/2508.21622

work page arXiv 2025

[15] [15]

arXiv preprint arXiv:2509.03768 (2025)

Walker, C., Aslansefat, K., Akram, M.N., Papadopoulos, Y.: Raguard: A novel ap- proach for in-context safe retrieval augmented generation for llms. arXiv preprint arXiv:2509.03768 (2025). https://doi.org/10.48550/arXiv.2509.03768,https:// arxiv.org/abs/2509.03768

work page doi:10.48550/arxiv.2509.03768 2025

[16] [16]

arXiv preprint arXiv:2410.10852 (2024)

Walker, C., Rothon, C., Aslansefat, K., Papadopoulos, Y., Dethlefs, N.: Safellm: Domain-specific safety monitoring for large language models: A case study of offshore wind maintenance. arXiv preprint arXiv:2410.10852 (2024). https://doi.org/10.48550/arXiv.2410.10852,https://arxiv.org/abs/2410.10852

work page doi:10.48550/arxiv.2410.10852 2024

[17] [17]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdinov, R., Manning, C.D.: Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv:1809.09600 [cs] (09 2018),https://arxiv.org/abs/1809.09600

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: React: Synergizing reasoning and acting in language models (03 2023). https://doi.org/10.48550/arXiv.2210.03629,https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2023

[19] [19]

Yuan, X., Huang, Q., Guo, M., Ma, H., Xu, M., Liu, Z., He, X.: Towards next- generation intelligent maintenance: Collaborative fusion of large and small models (2025),https://arxiv.org/abs/2506.05854

work page arXiv 2025

[20] [20]

Zhang,L.,Zhao,C.,Gao,Q.,Zhao,X.,Bai,G.,Lv,J.:Dschellm:Enablingdynamic scheduling through a fine-tuned dual-system large language model (2026),https: //arxiv.org/abs/2601.09100

work page arXiv 2026