arxiv: 2602.12735 · v2 · submitted 2026-02-13 · 💻 cs.CV · cs.CL

Recognition: 3 theorem links

· Lean Theorem

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

Qiuchen Wang , Shihang Wang , Yu Zeng , Qiang Zhang , Fanrui Zhang , Zhuoning Guo , Bosi Zhang , Wenxuan Huang

show 4 more authors

Lin Chen Zehui Chen Pengjun Xie Ruixue Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:45 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords multimodal RAGvisual memory graphdirected acyclic graphretrieval augmented generationagent reasoningpolicy optimizationmultimodal evidence

0 comments

The pith

VimRAG structures multimodal reasoning as a dynamic directed acyclic graph to prioritize pivotal visual evidence in retrieval-augmented generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

VimRAG models the iterative reasoning process in multimodal retrieval-augmented generation as a dynamic directed acyclic graph of agent states and retrieved evidence from text, images, and videos. This structure allows the system to assess the importance of each memory node based on its topological position within the graph. The Graph-Modulated Visual Memory Encoding then allocates higher resolution tokens to nodes deemed pivotal while compressing others. A Graph-Guided Policy Optimization strategy improves learning by disentangling step-wise validity from overall trajectory rewards through pruning of redundant nodes. This leads to more effective handling of massive visual contexts compared to traditional linear interaction histories.

Core claim

VimRAG establishes that representing the reasoning trajectory as a dynamic directed acyclic graph enables precise evaluation of memory node significance via topological position, which in turn supports dynamic allocation of high-resolution tokens to critical visual evidence and facilitates fine-grained credit assignment through graph-guided policy optimization.

What carries the argument

A dynamic directed acyclic graph that structures agent states and multimodal evidence, with topological positions determining the significance for visual memory encoding and policy optimization.

If this is right

Improved performance on diverse multimodal RAG benchmarks involving long visual contexts.
More efficient use of tokens by prioritizing pivotal evidence over trivial clues.
Enhanced ability to handle iterative reasoning scenarios with information-sparse visual data.
Disentangled step-wise validity assessment leading to better overall trajectory rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This graph-based approach may extend to other agentic systems dealing with high-dimensional data streams.
Potential for combining with existing RAG techniques to further reduce computational overhead in visual reasoning.
Could inform designs for memory management in long-context multimodal models beyond RAG.

Load-bearing premise

That the topological position of nodes in the dynamic DAG accurately reflects the importance of the associated multimodal evidence for successful reasoning.

What would settle it

Demonstrating that a version of VimRAG without the topological modulation in memory encoding achieves equivalent performance on the benchmarks, or identifying a task where linear history methods consistently outperform the graph-structured approach.

Figures

Figures reproduced from arXiv: 2602.12735 by Bosi Zhang, Fanrui Zhang, Lin Chen, Pengjun Xie, Qiang Zhang, Qiuchen Wang, Ruixue Ding, Shihang Wang, Wenxuan Huang, Yu Zeng, Zehui Chen, Zhuoning Guo.

**Figure 1.** Figure 1: Inference pipeline of the VimRAG framework. (a) The cyclic inference loop consisting of reasoning, retrieval, and memory evolution. (b) details the Evolution of Structured Reasoning Topology, where each node stores agent-specific memory, including the action, dynamically compressed multimodal observations, and its corresponding temporal and topological structure. (c) illustrates the step-by-step process o… view at source ↗

**Figure 2.** Figure 2: Quantitative analysis of memory structures. (a) Distribution of total token consumption for complete samples. (b) Count of Invalid Retrieval Action. By modeling the agent’s current state rather than just storing facts, the Graphbased paradigm effectively avoids repetitive retrieval compared to the summary-based method. Experimental Settings. We compare three agentic memory paradigms based on current … view at source ↗

**Figure 3.** Figure 3: Empirical analysis of misalignment between outcome rewards and step validity. (a) Distribution of step categories across binary outcome rewards. (b) Impact of removing redundancy or evidence steps, demonstrating the coarseness of rewards. Experimental Settings. Let a trajectory be a sequence of steps τ = {s1,s2, . . . ,sT} Zhou et al. (2025). We decompose steps into two disjoint subsets: 1) Evidence Retri… view at source ↗

**Figure 4.** Figure 4: Overview of Graph-Guided Policy Optimization. (a) Agentic Memory Training Framework segments rollout trajectories into atomic reasoning cycles within the memory paradigm, where outcome-based advantages are broadcasted to enable step-level credit assignment. (b) Credit Assignment via Graph Pruning leverages the structured graph for precise credit assignment, applying gradient masks to avoid reinforcing in… view at source ↗

**Figure 5.** Figure 5: Ablation on GGPO. Our method is more robust than baseline GSPO without pruning. Furthermore, visual memory with EnergyBased Allocation achieves higher accuracy by prioritizing high-resolution tokens for critical nodes, which proves the effectiveness of our Graph-Modulated Visual Memory Encoding in optimizing the tradeoff between detail and efficiency. Finally, consistent with the stability shown in Fig… view at source ↗

**Figure 6.** Figure 6: Analysis of Robustness and Efficiency. (a) Retrieval Hit Rate across modalities. (b) Training entropy curves, demonstrating faster convergence with Graph Pruning. (c) Breakdown of inference steps, highlighting VimRAG’s reduced redundancy. 4.3 Analysis Robust retrieval serves as the foundation for high-quality generation. High-quality generation relies heavily on the precision of the retrieved context. As i… view at source ↗

**Figure 7.** Figure 7: Overview of the data construction pipeline. The process consists of three stages: (a) Video Preprocessing, where long videos are segmented and captioned by MLLMs; (b) Query Creation, where LLMs generate complex queries and logical steps based on sampled captions; and (c) Quality Review, involving semantic filtering and difficulty ranking to ensure data quality and challenge levels. B Environment and Experi… view at source ↗

**Figure 8.** Figure 8: Case Study (Part I). The agent initializes the Multimodal Memory Graph to address a complex query regarding a calculus lecture. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Case Study (Part II). The final answer is synthesized by traversing the critical path (vroot → v2 → (v3, v4) → v5). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt of ReAct. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt of Vanilla RAG. Reward Model Prompt System Prompt: Character Introduction You are an expert evaluation system for a question answering chatbot. You are given the following information: - the query - a generated answer - a reference answer Your task is to evaluate the correctness of the generated answer. Response Format Your response should be formatted as following: <judge>True or False</judge> If … view at source ↗

**Figure 12.** Figure 12: Prompt of Model-based Reward. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt of Iterative Summarization as Memory. [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt of Graph as Memory. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt of VimRAG. The model performs retrieval or generates an answer based on the [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt for Question Verifier. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

read the original abstract

Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at https://github.com/Alibaba-NLP/VRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VimRAG replaces linear histories with a dynamic DAG for multimodal RAG and allocates tokens by node position, but the claim that topology marks pivotal evidence lacks direct support.

read the letter

The paper's main contribution is a pipeline that models agent reasoning as a growing DAG of states and retrieved text-image-video items. It then uses topological position to decide token resolution for each node and applies a policy optimization that separates per-step validity from full-trajectory reward by pruning redundant branches. The code release lets others run the system directly. This setup targets a genuine bottleneck: visual tokens accumulate fast in long agent loops and linear buffers lose structure. The disentangled optimization and the explicit graph construction are practical additions that go beyond simply adding more context length. Experiments report consistent gains over prior multimodal RAG baselines on several benchmarks, which is the strongest evidence the authors provide. The soft spot sits in the central mechanism. Topological centrality in the DAG is treated as a proxy for semantic importance, yet nothing in the results shows that early central nodes are more decisive than later leaves. If retrieval order or the graph-building heuristic already correlates with importance, the token-allocation rule may simply be rediscovering that ordering rather than adding new signal. The paper would be stronger with an ablation that swaps the topology rule for random or recency-based allocation and measures the drop. Without that, the performance edge could come from the overall compression strategy rather than the graph structure itself. Readers working on agentic multimodal systems or graph-augmented memory will find the implementation details useful. The work is coherent enough on its own terms to merit peer review; the problem is real and the code is public, even if the justification for the topology rule needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces VimRAG, a framework for multimodal retrieval-augmented reasoning over text, images, and videos. It models the agent's reasoning process as a dynamic directed acyclic graph (DAG) of states and retrieved multimodal evidence, proposes a Graph-Modulated Visual Memory Encoding scheme that allocates high-resolution tokens according to each node's topological position in the DAG, and introduces Graph-Guided Policy Optimization to separate step-wise validity from trajectory-level rewards via pruning of redundant nodes. Extensive experiments are reported to show consistent state-of-the-art performance on diverse multimodal RAG benchmarks.

Significance. If the core mechanisms are shown to work as described, the work offers a concrete way to move beyond linear interaction histories in long-context multimodal agents. The graph-structured memory and topology-driven token allocation could reduce the quadratic cost of visual tokens while preserving reasoning fidelity, which would be a useful contribution to agentic RAG systems that must handle sparse but token-heavy visual evidence over many steps.

major comments (2)

[Graph-Modulated Visual Memory Encoding] The central efficiency claim rests on the assertion that topological position within the constructed DAG reliably signals semantic importance for token allocation. No ablation, correlation analysis, or counter-example study is presented to demonstrate that early central nodes are not merely artifacts of retrieval order while decisive visual clues appear in later leaves. Without such verification the claimed advantage over linear-history baselines cannot be isolated from graph-construction heuristics.
[Graph-Guided Policy Optimization] The Graph-Guided Policy Optimization is described as disentangling step-wise validity from trajectory rewards by pruning redundant nodes, yet no quantitative results (e.g., credit-assignment accuracy, policy-gradient variance, or comparison against standard PPO) are shown to confirm that the pruning step improves learning rather than simply discarding useful trajectories.

minor comments (2)

[Abstract] The abstract refers to a 'systematic study' that motivated the DAG design, but the study itself is not summarized or referenced with concrete findings.
[Experiments] Figure captions and table headers should explicitly state the number of runs, random seeds, and statistical tests used to support the SOTA claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to include the requested empirical validations.

read point-by-point responses

Referee: [Graph-Modulated Visual Memory Encoding] The central efficiency claim rests on the assertion that topological position within the constructed DAG reliably signals semantic importance for token allocation. No ablation, correlation analysis, or counter-example study is presented to demonstrate that early central nodes are not merely artifacts of retrieval order while decisive visual clues appear in later leaves. Without such verification the claimed advantage over linear-history baselines cannot be isolated from graph-construction heuristics.

Authors: We agree that isolating the contribution of topological position from retrieval-order artifacts is necessary. In the revised manuscript we will add: (1) an ablation replacing topology-based token allocation with retrieval-order-based allocation, (2) Pearson correlation between node centrality metrics and human-annotated semantic importance on a 100-example subset, and (3) qualitative counter-examples where decisive visual evidence resides in leaf nodes. These additions will demonstrate that the reported gains are not solely due to graph-construction heuristics. revision: yes
Referee: [Graph-Guided Policy Optimization] The Graph-Guided Policy Optimization is described as disentangling step-wise validity from trajectory rewards by pruning redundant nodes, yet no quantitative results (e.g., credit-assignment accuracy, policy-gradient variance, or comparison against standard PPO) are shown to confirm that the pruning step improves learning rather than simply discarding useful trajectories.

Authors: We acknowledge the absence of direct optimization diagnostics. The revision will include: (1) policy-gradient variance measurements with and without the pruning step, (2) credit-assignment accuracy evaluated on synthetic trajectories with known per-step rewards, and (3) a head-to-head comparison against standard PPO on the same multimodal RAG benchmarks. These metrics will quantify whether pruning improves learning stability beyond trajectory discarding. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experimental benchmarks

full rationale

The paper introduces VimRAG by modeling agent reasoning as a dynamic DAG and proposing Graph-Modulated Visual Memory Encoding that assigns tokens based on topological position, plus Graph-Guided Policy Optimization for credit assignment. These are design choices whose validity is asserted via extensive experiments on multimodal RAG benchmarks rather than any closed mathematical derivation. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text; the SOTA performance claim is therefore independent of the modeling steps and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The framework rests on three newly introduced mechanisms whose correctness is not independently verified in the provided abstract.

invented entities (3)

Multimodal Memory Graph no independent evidence
purpose: Dynamic DAG that structures agent states and retrieved evidence
Core modeling choice introduced to replace linear interaction histories
Graph-Modulated Visual Memory Encoding no independent evidence
purpose: Allocate high-resolution tokens according to topological position of memory nodes
New encoding mechanism proposed to handle token-heavy visual data
Graph-Guided Policy Optimization no independent evidence
purpose: Disentangle step-wise validity from trajectory-level rewards by pruning redundant nodes
Optimization strategy introduced for fine-grained credit assignment

pith-pipeline@v0.9.0 · 5561 in / 1170 out tokens · 27782 ms · 2026-05-15T22:45:35.999680+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

Ω(mi,k)=Eint(mi,k)+γ∑vj∈Child(vi)Ω(vj) ... out-degree of node vi ... recursive feedback
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Graph-Guided Policy Optimization ... pruning memory nodes associated with redundant actions ... critical path from the root to the answer node

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

During the retrieval phase, it directly uses the original question to search for relevant text, images and videos, which are then inserted into the context to answer the question

Vanilla RAG. During the retrieval phase, it directly uses the original question to search for relevant text, images and videos, which are then inserted into the context to answer the question. Please refer to Appendix J.3 for the detailed prompt

work page
[2]

ReAct RAGYao et al. (2022). The method prompts the RAG agent using a Thought-Action- Observation loop format. Please refer to Appendix J.4 for the detailed prompt

work page 2022
[3]

VideoRAGJeong et al. (2025). This method performs frame selection to extract the informa- tion required for inference. We use GVE Guo et al. (2025) to compute similarity between frames and the query. Although this method is designed for video, embedding model allows us to apply the same coarse-to-fine granularity strategy to both text and images, serving ...

work page 2025
[4]

UniversalRAGYeo et al. (2025). It introduces RAG within cross -modal corpora by formulat- ing the task as a routing problem. We use Qwen3VL-8B (4B) as the router to align different settings, and the prompts are borrowed from the original code to ensure a fair comparison

work page 2025
[5]

MemAgentYu et al. (2025a). We implement this method by sequentially feeding the long-context search results into the model’s context. Specifically, we directly use the original question to retrieve relevant text, images, and videos, treat the retrieved results understand- ing as a long-context multimodal understanding task, and then use MemAgent to proces...

work page
[6]

Mem1Zhou et al. (2025). This approach updates its memory through a cyclical retrieval-then-memorization process. It is a context -management paradigm that is nat- urally well-suited for RAG tasks. This method is highly similar to our pilot study in Section 2.2 and follows an iterative summarization paradigm. An approximate version of this effect can be ac...

work page 2025
[7]

(2018) is a large-scale dataset focused on multi-hop question an- swering that requires reasoning across multiple documents

HotpotQAYang et al. (2018) is a large-scale dataset focused on multi-hop question an- swering that requires reasoning across multiple documents. It contains approximately 113,000 Wikipedia-based question-answer pairs. Unlike datasets constrained by pre-existing knowledge bases, it features diverse natural language questions and provides sentence-level sup...

work page 2018
[8]

(2016) is a large-scale reading comprehension dataset consisting of over 100,000 questions created by crowdworkers on a set of Wikipedia articles

SQuADRajpurkar et al. (2016) is a large-scale reading comprehension dataset consisting of over 100,000 questions created by crowdworkers on a set of Wikipedia articles. Unlike pre- vious datasets that relied on multiple-choice answers or cloze-style tasks, SQuAD requires the model to select a specific segment of text (span) from the reading passage as the...

work page 2016
[9]

(2022) is a multimodal dataset designed to mimic open-domain web search scenarios

WebQAChang et al. (2022) is a multimodal dataset designed to mimic open-domain web search scenarios. It consists of questions that require multi-hop reasoning over both text snippets and images to find the correct answer. Unlike standard VQA tasks where the image is the primary context, WebQA treats images and text as valid knowledge sources that need to ...

work page 2022
[10]

(2023) is a dataset for document visual question answering focused on understanding slides

SlideVQATanaka et al. (2023) is a dataset for document visual question answering focused on understanding slides. It contains over 2,600 slide decks with more than 52,000 slide images and 14,500 questions that require complex reasoning skills such as single-hop, multi- hop, and numerical reasoning. The dataset is designed to support various reasoning type...

work page 2023
[11]

MMLongbenchMa et al. (2024) is a dataset designed to evaluate the document under- standing capabilities of VLMs with an emphasis on long-context, multi-modal documents composed of text, images, charts, tables, and layout structures

work page 2024
[12]

(2025c) is a benchmark specifically designed to evaluate long video understanding capabilities

LVBenchWang et al. (2025c) is a benchmark specifically designed to evaluate long video understanding capabilities. Unlike datasets focused on short clips, it comprises 103 publicly sourced videos with an average duration of approximately 68 minutes, covering diverse categories such as movies, documentaries, and sports. The dataset contains 1,549 manually ...

work page
[13]

(2023); Miech et al

WikiHowQA with HowTo100MBolotova-Baranova et al. (2023); Miech et al. (2019); Jeong et al. (2025) is a composite benchmark constructed to evaluate video-based retrieval and generation tasks. It combines high-quality, human-written instructional questions and answers from the WikiHowQA dataset with the HowTo100M corpus, which consists of millions of instru...

work page 2023
[14]

(2025); Miech et al

Synthetic QA with HowTo100MJeong et al. (2025); Miech et al. (2019) is a dataset automati- cally generated to address the lack of training data containing query-video-answer triples for RAG systems. Built upon the HowTo100M corpus, it uses advanced Large Vision-Language Models to create diverse question-answer pairs grounded in specific videos. The questi...

work page 2025
[15]

optimization

XVBenchis a benchmark designed to address the lack of evaluation standards for cross- video understanding. We construct this dataset using a comprehensive pipeline in Figure 7 19 Technical Report Tongyi-RAG that performs fine-grained video segmentation, detailed captioning, and reasoning-graph construction powered by Qwen3-Max. To ensure the quality and a...

work page
[17]

Formulate clear and specific search strings using thesearchfunction

work page
[18]

Strictly Prohibited Behaviors

Compose a clear and concise final response based on the information obtained. Strictly Prohibited Behaviors

work page
[19]

Do not provide answers using information not obtained through the designated tools

work page
[20]

Do not fabricate or extrapolate beyond the content returned by the tools

work page
[21]

Do not output vague summaries or unverified speculations

work page
[22]

Do not call the search engine and give the answer in the same response. Reply Format Youmustrespond with the following format: Option 1: Searching <thinking>Your reasoning process</thinking> <search>Your search query</search> Option 2: Answering <thinking>Your reasoning process</thinking> <answer>Your detailed response</answer> User Prompt: Execution Instructions

work page
[23]

You must conduct reasoning inside<thinking>tags before answering or searching

work page
[24]

If you lack knowledge, call the search engine using<search>tags

work page
[25]

You may search as many times as needed

work page
[26]

judge" to True. Otherwise, please set

Once sufficient information is gathered, provide the final answer inside<answer>tags. Required Response Format When searching: <thinking>Your reasoning process</thinking> <search>Your search query</search> When answering: <thinking>Your reasoning process</thinking> <answer>Your detailed response</answer> User Query {Query Description} Figure 10: Prompt of...

work page
[27]

Comprehend the user’s query and identify the core points of inquiry

work page
[28]

Formulate clear and specific search strings to retrieve relevant information using the searchfunction

work page
[29]

Every time you call the search engine, you need to update the memory according to the search results and the current memory

work page
[30]

### Requirements

Compose a clear and concise final response if the information is sufficient. ### Requirements

work page
[31]

Ensure tool usage is precise and queries are well-formulated

work page
[32]

Provide accurate and well-structured answers to user queries

work page
[33]

Iterate search attempts if initial results are insufficient

work page
[34]

You can only provide a final answer or use a search engine, but not both in the same response

work page
[35]

You must call the search engine to get the search results at least once

work page
[36]

### Strictly Prohibited Behaviors:

Follow the response format. ### Strictly Prohibited Behaviors:

work page
[37]

Providing answers using information not obtained through the designated tools

work page
[38]

Fabricating or extrapolating beyond the content returned by the tools

work page
[39]

Outputting vague summaries, hypothetical judgments, or unverified speculations

work page
[40]

Repeatedly using semantically similar queries when calling the search engine

work page
[41]

Do not call the search engine and give the answer in the same response. ### Reply Format Youmustresponse with the following format: When you need to search, you need to provide the search query in the following format: <think>Your reasoning process</think> <search>Your search query</search> When you need update memory, you need to provide the summary in t...

work page
[49]

name": <function-name>,

Once you believe you have enough information to answer the question, please output an add_answer_nodefunction call. Available Tools 1.add_search_node Description:Creates a new search node in the graph. This tool should be used to issue a search query to an external engine. Each node must have a unique, summarized ID reflecting its intent. Parameters: •id ...

work page
[50]

You can only add one node per turn

work page
[51]

Each search node must: (a) Have a unique id (title) that is a short, descriptive phrase summarizing the query intent; (b) Be connected to its parent via a directed edge (specify parent_id); (c) Contain a query field with the actual search string; (d) The query must be substantially different from prior ones

work page
[52]

Then, you must summarize the relevant content from those results into a concise summary (which will be added externally to the node)

After issuing a search query, you will receive results. Then, you must summarize the relevant content from those results into a concise summary (which will be added externally to the node)

work page
[53]

You must decide at each step whether to: Answer directly (output an answer node), OR Search (output asearchnode with a new query)

work page
[54]

Queries must be substantially different from prior ones—avoid redundancy or rephrasing the same idea

work page
[55]

When generating asearchnode, use theadd_search_nodefunction

work page
[56]

When receiving search results, you can summarize the results with the summary_search_nodefunction

work page
[57]

name": <function-name>,

Once you believe you have enough information to answer the question, please output an add_answer_nodefunction call. Available Tools 1.add_search_node Description:Creates a new search node in the graph. This tool should be used to issue a search query to an external engine. Each node must have a unique, summarized ID reflecting its intent. Parameters: •id ...

work page