pith. sign in

arxiv: 2605.22478 · v2 · pith:FDTODNLPnew · submitted 2026-05-21 · 💻 cs.CV

DeliCIR: Deliberative Test-Time Evolutionary Hierarchical Multi-Agents for Composed Image Retrieval

Pith reviewed 2026-05-22 06:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot compositional image retrievalmulti-agent systemstest-time scalingself-evolutionhierarchical architectureimage retrievalperception and deliberation
0
0 comments X

The pith

A hierarchical multi-agent framework with experience self-evolution and test-time scaling achieves state-of-the-art results in zero-shot compositional image retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a Perception-to-Deliberation Framework can overcome perception myopia in single spaces and logic drift in iterative systems for zero-shot compositional image retrieval. It uses a multi-agent setup where an Intent Routing Manager dispatches multi-view perceptions to form a high-recall candidate pool, followed by a Decision Manager that applies training-free reasoning policy distillation and tournament-style test-time scaling to refine results through self-evolution. A sympathetic reader would care because this offers a training-free path to more accurate matching of reference images with modification texts, useful for search and recommendation tasks. The work shows that experience-driven self-evolution and test-time scaling laws scale effectively for fine-grained multimedia retrieval. This points to a general direction for improving zero-shot systems by adding deliberation layers at test time.

Core claim

The central claim is that the one-stop hierarchical Perception-to-Deliberation Framework (PDF) is the first to bring experience self-evolution and Test-Time Scaling Law into ZS-CIR; it first deploys an Intent Routing Manager to dynamically dispatch multi-view Worker perception signals according to modification intents and build a high-recall candidate pool, then lets the Decision Manager combine Training-free Reasoning Policy Distillation with a Tournament-style TTS strategy to produce self-evolving fine-grained reasoning that yields the final retrieval results.

What carries the argument

The hierarchical multi-agent architecture with an Intent Routing Manager that builds high-recall candidate pools from dynamic multi-view perceptions and a Decision Manager that enables self-evolving reasoning via training-free policy distillation plus tournament-style test-time scaling.

If this is right

  • PDF delivers state-of-the-art performance on the CIRR, CIRCO, and FashionIQ benchmarks for zero-shot compositional image retrieval.
  • Experience self-evolution and test-time scaling laws form a scalable route to zero-shot fine-grained multimedia retrieval without additional training.
  • The Intent Routing Manager constructs high-recall candidate pools by dispatching multi-view perception signals based on modification intents.
  • The overall framework simultaneously mitigates perception myopia in single spaces and logic drift in iterative collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The deliberation approach could extend to other compositional tasks such as video retrieval or visual question answering.
  • Test-time scaling might let systems trade extra compute for higher accuracy in deployed retrieval applications.
  • Multi-agent self-evolution offers a route to reduce dependence on large labeled training sets in retrieval systems.
  • Similar tournament-style refinement could be tested on other base retrievers to measure how far it lifts their effective ceiling.

Load-bearing premise

The training-free reasoning policy distillation and tournament-style test-time scaling can produce reliable self-evolving fine-grained reasoning without being limited by the perception ceiling of the underlying retriever or introducing new logic drift.

What would settle it

If disabling the Decision Manager's self-evolution and tournament-style TTS components produces no gain or a drop in retrieval metrics on the CIRR, CIRCO, or FashionIQ datasets, or if the outputs show new inconsistent reasoning paths compared to the base setup.

Figures

Figures reproduced from arXiv: 2605.22478 by Changwei Wang, Rongtao Xu, Shengpeng Xu, Shibiao Xu, Shunpeng Chen, Xingtian Pei, Yukun Song.

Figure 1
Figure 1. Figure 1: Comparison of three mainstream ZS-CIR paradigms. (a) Unified [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed PDF. (a) IPR adaptively fuses multi-view priors to construct a high-quality candidate pool. (b) RPD consolidates the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of intent-aware prior allocation across different query [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case analysis on CIRCO. Single-branch semantic prediction retrieval is often constrained by perceptual bottlenecks, which may lead to rank inversion [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of candidate pool size K on retrieval performance and test￾time average output tokens per query on CIRCO. The solid lines show mAP performance under different K, while the dashed line indicates the average output tokens per query. The shaded region highlights K = 50, which achieves the best trade-off between retrieval accuracy and computational cost. in consistent performance drops. This verifies th… view at source ↗
read the original abstract

Composed Image Retrieval (CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitute the core challenge of the task. Existing methods often suffer from Perception Myopia in a single space, or fall into Logic Drift in iterative collaboration due to the perception ceiling of the underlying retriever. To address this issue, we propose a one-stop hierarchical Perception-to-Deliberation Framework (PDF), which, to the best of our knowledge, is the first to introduce experience self-evolution and Test-Time Scaling Laws (TTS) into CIR. Relying on a hierarchical multi-agent architecture, PDF first utilizes an Intent Routing Manager to dynamically dispatch multi-view Worker perception signals based on modification intents to construct a high-recall candidate pool. Subsequently, the Decision Manager combines a Training-free Reasoning Policy Distillation mechanism with a Tournament-style TTS (T-TTS) strategy to achieve self-evolving fine-grained reasoning, yielding the final retrieval results. Experimental results demonstrate that PDF achieves SOTA performance on three benchmark datasets: CIRR, CIRCO, and FashionIQ. This study indicates that experience-driven self-evolution and TTS represent a highly promising and scalable path for achieving zero-shot fine-grained multimedia retrieval. The code will be made publicly available upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a hierarchical multi-agent Perception-to-Deliberation Framework (PDF) for Zero-Shot Compositional Image Retrieval (ZS-CIR). An Intent Routing Manager dispatches multi-view Worker signals to construct a high-recall candidate pool; a Decision Manager then applies Training-free Reasoning Policy Distillation together with a Tournament-style Test-Time Scaling (TTS) strategy to produce self-evolving fine-grained reasoning and the final match. The work reports state-of-the-art results on the CIRR, CIRCO, and FashionIQ benchmarks and positions experience self-evolution plus TTS as a scalable direction for training-free fine-grained retrieval.

Significance. If the gains are shown to arise specifically from the distillation and tournament mechanisms rather than from additional inference compute or simple ensembling, the introduction of test-time self-evolution into ZS-CIR would constitute a meaningful methodological advance. The approach is training-free and therefore potentially broadly applicable; reproducible code is promised, which would further strengthen its utility to the community.

major comments (3)
  1. [§4] §4 (Experimental Results): the SOTA claim on CIRR, CIRCO, and FashionIQ is presented without accompanying tables, ablation studies, or error-type breakdowns. Quantitative evidence isolating the contribution of Training-free Reasoning Policy Distillation and Tournament-style TTS versus the initial high-recall pool is required to substantiate that the pipeline exceeds the base retriever’s perception ceiling.
  2. [§3.2] §3.2 (Decision Manager): the description of Training-free Reasoning Policy Distillation does not specify how reasoning policies are extracted from retriever signals without training or how the tournament iteration prevents amplification of perception errors or introduction of logic drift. A concrete algorithmic outline or illustrative example of one distillation–tournament cycle would directly address the central mechanistic claim.
  3. [§4.3] §4.3 (Ablation Studies): if present, the ablation on per-iteration improvement or on the effect of removing the tournament component should be expanded to include recall curves and failure-case analysis; without such data the claim that self-evolution reliably corrects rather than compounds initial retrieval errors remains under-supported.
minor comments (2)
  1. [Introduction] The terms “Perception Myopia” and “Logic Drift” are used in the introduction without explicit definitions or citations to prior literature; adding one-sentence operational definitions would improve clarity.
  2. [Figure 1] Figure 1 (overall architecture) would benefit from explicit labeling of the data flow between Intent Routing Manager and Decision Manager to make the hierarchical structure immediately legible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript describing the Perception-to-Deliberation Framework for zero-shot compositional image retrieval. We have carefully reviewed each major comment and provide point-by-point responses below. We will incorporate revisions to address the concerns regarding experimental substantiation and mechanistic clarity.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Results): the SOTA claim on CIRR, CIRCO, and FashionIQ is presented without accompanying tables, ablation studies, or error-type breakdowns. Quantitative evidence isolating the contribution of Training-free Reasoning Policy Distillation and Tournament-style TTS versus the initial high-recall pool is required to substantiate that the pipeline exceeds the base retriever’s perception ceiling.

    Authors: We agree that stronger quantitative isolation of contributions is needed to support the SOTA claims. In the revised manuscript we will add full performance tables for CIRR, CIRCO, and FashionIQ together with ablation studies that directly compare the complete PDF pipeline against the base retriever, the high-recall pool alone, and simple ensembling baselines. These ablations will quantify the incremental gains attributable to Training-free Reasoning Policy Distillation and Tournament-style TTS. We will also include error-type breakdowns to show where the self-evolution mechanism improves upon specific failure modes of the initial perception stage. revision: yes

  2. Referee: [§3.2] §3.2 (Decision Manager): the description of Training-free Reasoning Policy Distillation does not specify how reasoning policies are extracted from retriever signals without training or how the tournament iteration prevents amplification of perception errors or introduction of logic drift. A concrete algorithmic outline or illustrative example of one distillation–tournament cycle would directly address the central mechanistic claim.

    Authors: We accept that the current description in §3.2 leaves the extraction and error-mitigation mechanics underspecified. We will revise this section to provide a concrete algorithmic outline of the training-free distillation process, detailing how reasoning policies are derived from retriever signals. We will also insert an illustrative walk-through of one complete distillation–tournament cycle, showing the iterative refinement steps and how the tournament structure limits propagation of perception errors while avoiding logic drift. revision: yes

  3. Referee: [§4.3] §4.3 (Ablation Studies): if present, the ablation on per-iteration improvement or on the effect of removing the tournament component should be expanded to include recall curves and failure-case analysis; without such data the claim that self-evolution reliably corrects rather than compounds initial retrieval errors remains under-supported.

    Authors: We agree that the existing ablation results would be strengthened by additional supporting data. In the revision we will expand §4.3 to include per-iteration recall curves across tournament rounds and a dedicated failure-case analysis. These additions will demonstrate that the self-evolution process consistently corrects rather than amplifies initial retrieval errors from the high-recall pool. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The provided abstract and context describe a training-free hierarchical multi-agent architecture for ZS-CIR that introduces Intent Routing Manager, Training-free Reasoning Policy Distillation, and Tournament-style TTS to address perception myopia and logic drift. No equations, fitted parameters, self-citations, or ansatzes are quoted that reduce any claimed prediction or result to its own inputs by construction. The SOTA performance is presented as an empirical outcome on external benchmarks (CIRR, CIRCO, FashionIQ) rather than a forced renaming or self-referential fit. The central claims rest on architectural design choices and test-time scaling, which remain independently falsifiable against the base retriever without reducing to self-definition or load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that multi-view worker signals can be dynamically routed without loss of information and that tournament-style deliberation improves reasoning beyond the base retriever's ceiling; no explicit free parameters or invented physical entities are described.

axioms (1)
  • domain assumption The underlying retriever has a fixed perception ceiling that can be overcome by hierarchical deliberation rather than by retraining.
    Invoked when the abstract states that existing methods fall into Logic Drift due to the perception ceiling.

pith-pipeline@v0.9.0 · 5785 in / 1250 out tokens · 35420 ms · 2026-05-22T06:36:50.194183+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.