pith. sign in

arxiv: 2605.25030 · v1 · pith:MRWVZ6PGnew · submitted 2026-05-24 · 💻 cs.LG

MimirRAG: A Multi-Agent RAG Framework for Financial Data Retrieval with Metadata Integration

Pith reviewed 2026-06-30 12:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords retrieval-augmented generationmulti-agent systemsfinancial data retrievalmetadata integrationtable-aware chunkingFinanceBenchlarge language modelsagentic workflow
0
0 comments X

The pith

A multi-agent RAG system with metadata integration and table-aware chunking reaches 89.3 percent accuracy on FinanceBench financial questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MimirRAG as a system that combines multiple agents with metadata from parsed financial filings to improve retrieval-augmented generation for mixed PDF documents. It argues that typical RAG approaches fall short when documents contain tables and require precise numerical grounding rather than model-generated answers. The work shows that adding structure-preserving parsing, table-aware chunking, metadata extraction, agent-driven query planning, and hybrid search produces higher accuracy. Evaluation used both a standard benchmark and feedback from financial analysts. If the approach holds, it would allow more reliable extraction of evidence-based insights directly from company filings.

Core claim

MimirRAG features a modular pipeline that performs structure-preserving parsing of PDF filings, table-aware chunking, metadata extraction, agent-based retrieval with query planning and hybrid search, validation, and context-aware generation with numerical reasoning support. The system reached 89.3 percent accuracy on FinanceBench, outperforming the original benchmark baselines. An ablation study identified metadata integration, table-aware chunking, and an agentic workflow as the three key technical enablers. Qualitative review with four financial analysts added that successful use also requires calibrated trust, comprehensive data integration, and user personalization.

What carries the argument

The multi-agent retrieval workflow that plans queries, performs hybrid search, and fuses results with metadata extracted during parsing of financial documents.

If this is right

  • Metadata attached to document sections improves retrieval precision when filings mix text and tables.
  • Table-aware chunking preserves numerical relationships that standard splitting methods break.
  • Agent-based query planning handles questions that span multiple sections of a filing more effectively than single-pass retrieval.
  • Validation steps before generation reduce unsupported numerical claims in financial answers.
  • Human-centric elements such as personalization become necessary for practical deployment alongside the technical pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same combination of metadata and agent planning could transfer to other regulated document domains that require evidence grounding.
  • Adding explicit validation layers may lower hallucination rates in any high-stakes retrieval setting.
  • Over time the modular design could support incremental updates when new filings arrive without full retraining.
  • Testing on live market events would show whether the current accuracy holds when questions involve time-sensitive data not present in static benchmarks.

Load-bearing premise

That results on the FinanceBench dataset together with feedback from four financial analysts are enough to show the approach works in real financial analysis tasks.

What would settle it

A new test set of financial questions drawn from recent filings outside FinanceBench on which MimirRAG accuracy falls to or below standard RAG baselines.

Figures

Figures reproduced from arXiv: 2605.25030 by Magnus Samuelsen, Mansoor Hussain, Mikkel Strange, Somnath Mazumdar, Wilmer Nystr\"om.

Figure 1
Figure 1. Figure 1: Iterative BIE process in MimirRAG development. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MimirRAG pipeline: pre-retrieval, retrieval, and post-retrieval stages. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pre-retrieval data pipeline representation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Entity-relationship (ER) diagram of vector database [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Hybrid Search Process: A query is first augmented with keywords and metadata filters. Relevant information is retrieved and used as pre-filters at the chunk level. A hybrid search is done over all relevant chunks and restricted to top-K 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 … view at source ↗
Figure 6
Figure 6. Figure 6: illustrates this dynamic interplay, showcasing how agents exchange structured messages. This architecture supports concurrent execution (parallel searches and validations) and Pydantic AI orchestration logic that handles fallbacks and retries. Data integrity is paramount, thus all agent outputs are validated against their Pydantic models. This strict validation, coupled with automatic retries (three by def… view at source ↗
Figure 7
Figure 7. Figure 7: Sequence diagram illustrates the interaction between agents in the hierarchical retrieval and validation pipeline. Green arrows tion and guidance an agent would need to perform its task effectively, i.e., ‘thinking like the agent’. The overall prompt [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: High-level Azure deployment diagram for MimirRAG. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Screenshots of MimirRAG prototype: home page (top), query interface (middle) and chat interface (bottom) [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Output example comparing Naive RAG (left) with MimirRAG (right, using GPT-4.1-mini) for a question on Verizon’s FY 2022 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Iteration 1: Average chunk similarity (Cosine and Levenshtein). [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Iteration 1: Answer outcomes by reasoning type for GPT-4.1 (10). [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Iteration 2: Average chunk similarity (Cosine and Levenshtein). [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
read the original abstract

Retrieval-augmented generation (RAG) systems offer a promising approach to reduce hallucinations and improve answer accuracy in large language models (LLMs), a requirement for reliable, financial analysis where answers must be grounded in verifiable evidence from filings rather than generated from model priors. However, designing RAG systems that extract meaningful insights from mixed financial documents and integrate into analyst workflows remains challenging. This paper introduces MimirRAG (Metadata-Integrated Multi-Agent Information Retrieval), a multi-agent RAG system developed iteratively to address these challenges. MimirRAG features a modular pipeline encompassing structure-preserving parsing of PDF filings, table-aware chunking, metadata extraction, agent-based retrieval with query planning and hybrid search, validation, and context-aware generation with numerical reasoning support. Our ablation study identifies three key technical enablers for effective financial RAG: metadata integration, table-aware chunking, and an agentic workflow. MimirRAG was evaluated quantitatively using FinanceBench and qualitatively through expert validation with four financial analysts. The system achieved 89.3% accuracy on FinanceBench, outperforming the original benchmark baselines. Expert feedback highlighted that successful deployment also requires calibrated trust, comprehensive data integration, and user personalization. We conclude that combining multi-agent RAG architecture with human-centric design principles can improve the extraction of meaningful insights in financial analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces MimirRAG, a multi-agent RAG system for financial document retrieval featuring structure-preserving PDF parsing, table-aware chunking, metadata extraction, agent-based retrieval with query planning and hybrid search, validation, and context-aware generation. It reports 89.3% accuracy on FinanceBench (outperforming original baselines), identifies metadata integration, table-aware chunking, and agentic workflow as key enablers via ablation, and includes qualitative expert validation with four financial analysts.

Significance. If the performance and ablation claims are substantiated with more rigorous controls and testing, the work could provide practical value for domain-specific RAG systems in finance by highlighting the role of metadata and multi-agent workflows in handling mixed text/table documents. The modular pipeline offers a concrete implementation example, though the current evaluation limits broader claims of superiority or generalization.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation section: The central claim of 89.3% accuracy on FinanceBench is reported without error bars, details on data splits or train/test methodology, baseline implementation specifics, or statistical significance testing on the performance improvement; this directly affects the ability to assess whether the system outperforms baselines in a robust manner.
  2. [Ablation study] Ablation study: The identification of the three technical enablers (metadata integration, table-aware chunking, agentic workflow) lacks quantitative isolation of their individual effects, controls for confounding variables such as prompt variations or model choice, and explicit comparison metrics, making the attribution of performance gains load-bearing but under-supported.
  3. [Expert validation] Expert validation: Reliance on feedback from only four financial analysts without quantitative metrics, sampling details, or controls for workflow variability provides insufficient evidence for claims of real-world effectiveness and generalization beyond the FinanceBench distribution.
minor comments (1)
  1. [Abstract] The abstract refers to 'original benchmark baselines' without naming them or providing citations, which reduces clarity on the comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the rigor of our evaluation and analysis. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: The central claim of 89.3% accuracy on FinanceBench is reported without error bars, details on data splits or train/test methodology, baseline implementation specifics, or statistical significance testing on the performance improvement; this directly affects the ability to assess whether the system outperforms baselines in a robust manner.

    Authors: We agree that the current reporting lacks sufficient statistical detail. In the revised manuscript we will add error bars from multiple runs with varied seeds, clarify that FinanceBench is a fixed benchmark (no custom train/test split; we evaluate on the provided test set), provide exact baseline implementation details including models and prompts, and report statistical significance tests (e.g., paired t-tests) for the observed improvements. These changes will appear in the Evaluation section and be referenced in the abstract. revision: yes

  2. Referee: [Ablation study] Ablation study: The identification of the three technical enablers (metadata integration, table-aware chunking, agentic workflow) lacks quantitative isolation of their individual effects, controls for confounding variables such as prompt variations or model choice, and explicit comparison metrics, making the attribution of performance gains load-bearing but under-supported.

    Authors: We will strengthen the ablation study by presenting incremental component additions with a dedicated metrics table, fixing prompts across variants to control for prompt effects, and explicitly noting that the same LLM backbone was used throughout to control for model choice. While exhaustive isolation of every possible confounder would require additional experiments beyond the current scope, these revisions will provide clearer quantitative attribution for the three enablers. revision: partial

  3. Referee: [Expert validation] Expert validation: Reliance on feedback from only four financial analysts without quantitative metrics, sampling details, or controls for workflow variability provides insufficient evidence for claims of real-world effectiveness and generalization beyond the FinanceBench distribution.

    Authors: We accept that the expert validation is limited in scale and detail. The revision will expand the section to describe analyst selection criteria, the interview protocol, and any quantitative elements present in the feedback. We will also add an explicit limitations paragraph noting the small sample size and positioning the validation as supplementary qualitative insight rather than primary evidence of generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation on external benchmark

full rationale

The paper describes construction and evaluation of a multi-agent RAG pipeline for financial documents. Its central results are accuracy numbers obtained by running the system on the external FinanceBench benchmark and qualitative feedback from four analysts. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Ablation studies are standard removal experiments whose outcomes are measured against the same external benchmark rather than being forced by construction. The work is therefore self-contained against external data and does not reduce any claimed result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the paper is an engineering description of a retrieval pipeline rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5787 in / 1068 out tokens · 26521 ms · 2026-06-30T12:24:39.578202+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 35 canonical work pages · 10 internal anchors

  1. [1]

    Y. Nie, Y. Kong, X. Dong, J. M. Mulvey, H. V. Poor, Q. Wen, S. Zohren, A survey of large language mod- els for financial applications: Progress, prospects and challenges, arXiv preprint arXiv:2406.11903 (2024)

  2. [2]

    S. Ray, B. Srinivasan, Reducing hallucinations in large language models with custom intervention using amazon bedrock agents, amazon Web Services Blog, published November 26, 2024 (2024). URLhttps://aws.amazon.com/blogs/machine-l earning/reducing-hallucinations-in-large-l anguage-models-with-custom-intervention-usi ng-amazon-bedrock-agents/

  3. [3]

    Sarmah, D

    B. Sarmah, D. Mehta, S. Pasquali, T. Zhu, Towards reducing hallucination in extracting information from financialreportsusinglargelanguagemodels, in: Pro- ceedings of the Third International Conference on AI- ML Systems, 2023, pp. 1–5

  4. [4]

    Impink, M

    J. Impink, M. Paananen, A. Renders, Regulation- induced disclosures: evidence of information over- load?, Abacus 58 (3) (2022) 432–478

  5. [5]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented gen- eration for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474

  6. [6]

    Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, H. Wang, Retrieval- augmented generation for large language models: A survey, arXiv preprint arXiv:2312.10997 2 (2023)

  7. [7]

    Iaroshev, R

    I. Iaroshev, R. Pillai, L. Vaglietti, T. Hanne, Evaluat- ing retrieval-augmented generation models for finan- cial report question and answering., Applied Sciences (2076-3417) 14 (20) (2024).doi:10.3390/app14209 318. URLhttps://doi.org/10.3390/app14209318

  8. [8]

    A. J. Yepes, Y. You, J. Milczek, S. Laverde, R. Li, Financial report chunking for effective retrieval aug- mented generation, arXiv preprint arXiv:2402.05131 (2024)

  9. [9]

    T. Zhou, P. Wang, Y. Wu, H. Yang, Finrobot: Ai agent for equity research and valuation with large language models, arXiv preprint arXiv:2411.08804 (2024)

  10. [10]

    FinanceBench: A New Benchmark for Financial Question Answering

    P. Islam, A. Kannappan, D. Kiela, R. Qian, N. Scher- rer, B. Vidgen, Financebench: A new benchmark for financial question answering, arXiv preprint arXiv:2311.11944 (2023)

  11. [11]

    J. Lee, M. Roh, Multi-reranker: Maximizing perfor- mance of retrieval-augmented generation in the fi- nancerag challenge, arXiv preprint arXiv:2411.16732 (2024)

  12. [12]

    Setty, H

    S. Setty, H. Thakkar, A. Lee, E. Chung, N. Vidra, Improving retrieval for rag based question answer- ing models on financial documents, arXiv preprint arXiv:2404.07221 (2024)

  13. [13]

    Q. Leng, J. Portes, S. Havens, M. A. Zaharia, M.Carbin, Longcontextragperformanceoflargelan- guage models, ArXiv abs/2411.03538 (2024). URLhttps://api.semanticscholar.org/Corpus Id:273849918

  14. [14]

    Rafiq, How ragie outperformed the financebench test,https://www.ragie.ai/blog/ragie-outperf ormed-financebench, blog post (Oct 2024)

    M. Rafiq, How ragie outperformed the financebench test,https://www.ragie.ai/blog/ragie-outperf ormed-financebench, blog post (Oct 2024)

  15. [15]

    Rafiq, How ragie outperformed the financebench test — part 2,https://www.ragie.ai/blog/how-r agie-outperformed-the-financebench-test-par t-2, blog post, Part 2 (Nov 2024)

    M. Rafiq, How ragie outperformed the financebench test — part 2,https://www.ragie.ai/blog/how-r agie-outperformed-the-financebench-test-par t-2, blog post, Part 2 (Nov 2024)

  16. [16]

    AI, Financebench performance of mafin 2.5,http s://github.com/VectifyAI/Mafin2.5-FinanceBe nch, gitHub repository (2025)

    V. AI, Financebench performance of mafin 2.5,http s://github.com/VectifyAI/Mafin2.5-FinanceBe nch, gitHub repository (2025)

  17. [17]

    FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

    D. Araci, Finbert: Financial sentiment analysis with pre-trained language models, arXiv preprint arXiv:1908.10063 (2019)

  18. [18]

    S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, G. Mann, Bloomberggpt: A large language model for finance (2023).arXiv:2303.17564. URLhttps://arxiv.org/abs/2303.17564

  19. [19]

    X.-Y. Liu, G. Wang, H. Yang, D. Zha, Fingpt: Democratizing internet-scale data for financial large language models, arXiv preprint arXiv:2307.10485 (2023)

  20. [20]

    H. Yang, B. Zhang, N. Wang, C. Guo, X. Zhang, L. Lin, J. Wang, T. Zhou, M. Guan, R. Zhang, et al., Finrobot: an open-source ai agent platform for finan- cial applications using large language models, arXiv preprint arXiv:2405.14767 (2024)

  21. [21]

    S. Shah, S. Ryali, R. Venkatesh, Multi-document fi- nancial question answering using llms, arXiv preprint arXiv:2411.07264 (2024). 26

  22. [22]

    Sarmah, D

    B. Sarmah, D. Mehta, B. Hall, R. Rao, S. Patel, S. Pasquali, Hybridrag: Integrating knowledge graphs and vector retrieval augmented generation for effi- cient information extraction, in: Proceedings of the 5th ACM International Conference on AI in Finance, 2024, pp. 608–616

  23. [23]

    X. Li, Z. Li, C. Shi, Y. Xu, Q. Du, M. Tan, J. Huang, W. Lin, Alphafin: Benchmarking financial analy- sis with retrieval-augmented stock-chain framework, arXiv preprint arXiv:2403.12582 (2024)

  24. [24]

    S. Kim, H. Song, H. Seo, H. Kim, Optimizing re- trieval strategies for financial question answering doc- uments in retrieval-augmented generation systems, arXiv preprint arXiv:2503.15191 (2025)

  25. [25]

    V. D. Lai, M. Krumdick, C. Lovering, V. Reddy, C. Schmidt, C. Tanner, Sec-qa: A systematic evaluation corpus for financial qa, arXiv preprint arXiv:2406.14394 (2024)

  26. [26]

    Taghvaei, A

    F. Taghvaei, A. Elahi, Combining financial data and news articles for stock price movement prediction us- ing large language models, 2024 IEEE International Conference on Big Data (BigData) (2024) 4875–4883. URLhttps://api.semanticscholar.org/Corpus Id:273812246

  27. [27]

    A.Singh, A.Ehtesham, S.Kumar, T.T.Khoei, Agen- ticretrieval-augmentedgeneration: Asurveyonagen- tic rag, arXiv preprint arXiv:2501.09136 (2025)

  28. [28]

    S. Chen, L. Zeng, A. Raghunathan, F. Huang, T. C. Kim, Moa is all you need: Building llm re- search team using mixture of agents, arXiv preprint arXiv:2409.07487 (2024)

  29. [29]

    Poliakov, N

    M. Poliakov, N. Shvai, Multi-meta-rag: Improving rag for multi-hop queries using database filtering with llm-extracted metadata, in: International Confer- ence on Information and Communication Technolo- gies in Education, Research, and Industrial Applica- tions, Springer, 2024, pp. 334–342

  30. [30]

    Saunders, P

    M. Saunders, P. Lewis, A. Thornhill, Research Meth- ods for Business Students, 9th Edition, Pearson Edu- cation, 2023

  31. [31]

    M. K. Sein, O. Henfridsson, S. Purao, M. Rossi, R. Lindgren, Action design research, MIS quarterly (2011) 37–56

  32. [32]

    C. Auer, M. Lysak, A. Nassar, M. Dolfi, N. Livathi- nos, P. Vagenas, C. B. Ramis, M. Omenetti, F. Lindl- bauer, K. Dinkla, et al., Docling technical report, arXiv preprint arXiv:2408.09869 (2024)

  33. [33]

    Pfitzmann, C

    B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, P. Staar, Doclaynet: A large human-annotated dataset for document-layout segmentation, in: Pro- ceedings of the 28th ACM SIGKDD conference on knowledgediscoveryanddatamining, 2022, pp.3743– 3751

  34. [34]

    Nassar, N

    A. Nassar, N. Livathinos, M. Lysak, P. Staar, Table- former: Table structure understanding with trans- formers, in: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2022, pp. 4614–4623

  35. [35]

    Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

    L. Merrick, D. Xu, G. Nuti, D. Campos, Arctic- embed: Scalable, efficient, and accurate text embed- ding models, arXiv preprint arXiv:2405.05374 (2024)

  36. [36]

    P. Yu, L. Merrick, G. Nuti, D. Campos, Arctic-embed 2.0: Multilingualretrievalwithoutcompromise, arXiv preprint arXiv:2412.04506 (2024)

  37. [37]

    ToolGen: Unified Tool Retrieval and Calling via Generation

    K. Enevoldsen, et al., Mmteb: Massive multilin- gual text embedding benchmark, arXiv preprint arXiv:2502.13595 (2025).doi:10.48550/arXiv.2 502.13595. URLhttps://arxiv.org/abs/2502.13595

  38. [38]

    Y.Tang, Y.Yang, Finmteb: Financemassivetextem- bedding benchmark, arXiv preprint arXiv:2502.10990 (2025)

  39. [39]

    L. E. Erdogan, N. Lee, S. Kim, S. Moon, H. Fu- ruta, G. Anumanchipalli, K. Keutzer, A. Gholami, Plan-and-act: Improving planning of agents for long-horizon tasks, arXiv preprint arXiv:2503.09572 (2025)

  40. [40]

    L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.- W. Lee, E.-P. Lim, Plan-and-solve prompting: Im- proving zero-shot chain-of-thought reasoning by large language models, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), Association for Computatio...

  41. [41]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K.Narasimhan, Y.Cao, React: Synergizingreasoning and acting in language models, in: International Con- ference on Learning Representations (ICLR), 2023

  42. [42]

    Colvin, et al., Pydantic (Apr

    S. Colvin, et al., Pydantic (Apr. 2025).doi:10.528 1/zenodo.15174950

  43. [43]

    X. Tan, H. Wang, X. Qiu, Y. Cheng, Y. Xu, W. Chu, Y. Qi, Struct-x: Enhancing large language mod- els reasoning with structured data, arXiv preprint arXiv:2407.12522 (2024). 27

  44. [44]

    URLhttps://www.anthropic.com/engineering/ building-effective-agents

    Anthropic, Building effective ai agents (Dec 2024). URLhttps://www.anthropic.com/engineering/ building-effective-agents

  45. [45]

    Zhang, How we build effective agents, YouTube, presented at the AI Engineer Summit 2025, New York

    B. Zhang, How we build effective agents, YouTube, presented at the AI Engineer Summit 2025, New York. [Accessed: May 14, 2025] (2025). URLhttps://www.youtube.com/watch?v=051042 _2p5E

  46. [46]

    Nguyen, A

    Z. Nguyen, A. Annunziata, V. Luong, S. Dinh, Q. Le, A. H. Ha, C. Le, H. A. Phan, S. Raghavan, C. Nguyen, Enhancing q&a with domain-specific fine- tuning and iterative reasoning: A comparative study, arXiv preprint arXiv:2404.11792 (2024)

  47. [47]

    J. D. Lee, K. A. See, Trust in automation: Designing for appropriate reliance, Human factors 46 (1) (2004) 50–80

  48. [48]

    Shneiderman, Human-centered artificial intelli- gence: Reliable, safe & trustworthy, International Journal of Human–Computer Interaction 36 (6) (2020) 495–504

    B. Shneiderman, Human-centered artificial intelli- gence: Reliable, safe & trustworthy, International Journal of Human–Computer Interaction 36 (6) (2020) 495–504

  49. [49]

    URLhttps://www.europarl.europa.eu/topics/ en/article/20230601STO93804/eu-ai-act-first -regulation-on-artificial-intelligence

    European Parliament, EU AI Act: first regulation on artificial intelligence (2025). URLhttps://www.europarl.europa.eu/topics/ en/article/20230601STO93804/eu-ai-act-first -regulation-on-artificial-intelligence

  50. [50]

    C. Choi, J. Kwon, J. Ha, H. Choi, C. Kim, Y. Lee, J.-y. Sohn, A. Lopez-Lira, Finder: Fi- nancial dataset for question answering and evaluat- ing retrieval-augmented generation, arXiv preprint arXiv:2504.15800 (2025)

  51. [51]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, I. Stoica, Judging llm-as-a-judge with mt-bench and chatbot arena, ArXiv abs/2306.05685 (2023). URLhttps://arxiv.org/pdf/2306.05685.pdf

  52. [52]

    A Survey on LLM-as-a-Judge

    J. Gu, X.Jiang, Z. Shi, H. Tan, X.Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, et al., A survey on llm-as-a- judge, arXiv preprint arXiv:2411.15594 (2024)

  53. [53]

    W. Xu, G. Zhu, X. Zhao, L. Pan, L. Li, W. Wang, Pride and prejudice: Llm amplifies self-bias in self- refinement, in: Annual Meeting of the Association for Computational Linguistics, 2024. URLhttps://arxiv.org/pdf/2402.11436.pdf

  54. [54]

    J. Ye, Y. Wang, Y. Huang, D. Chen, Q. Zhang, N. Moniz, T. Gao, W. Geyer, C. Huang, P.-Y. Chen, N. V. Chawla, X. Zhang, Justice or prejudice? quantifying biases in llm-as-a-judge, arXiv preprint arXiv:2410.02736 (2024)

  55. [55]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, et al., Dspy: Compiling declarative language model calls into self-improving pipelines, arXiv preprint arXiv:2310.03714 (2023)

  56. [56]

    URLhttps://www.anthropic.com/engineering/ claude-think-tool 28

    Anthropic, The “think” tool: Enabling claude to stop and think (Mar 2025). URLhttps://www.anthropic.com/engineering/ claude-think-tool 28