pith. sign in

arxiv: 2606.08144 · v1 · pith:4LC4WZWCnew · submitted 2026-06-06 · 💻 cs.CV

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

Pith reviewed 2026-06-27 19:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords composed video retrievalschema imagerydynamic multimodal prototypesimplicit semanticsadaptive feature modulationcomposed image retrieval
0
0 comments X

The pith

Dynamic multimodal prototypes materialize implicit semantics from modification texts to guide composed video retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses composed video retrieval where modification texts often describe concepts implied through related visual cues rather than shown directly. It introduces IMAGINE to materialize these implicit semantics, called schema imagery, as shared latent concepts captured by dynamic multimodal prototypes. The prototypes then adaptively modulate visual features to inject the implicit guidance into matching. This bridges explicit video contents with retrieval intentions that are not visually explicit. The approach yields state-of-the-art results on standard CVR and CIR benchmarks.

Core claim

IMAGINE materializes implicit semantics termed schema imagery via dynamic multimodal prototypes. These prototypes capture shared latent concepts to adaptively modulate visual features, effectively injecting implicit guidance into the retrieval process and achieving state-of-the-art performance in both CVR and CIR across three benchmarks.

What carries the argument

Dynamic multimodal prototypes that materialize schema imagery to capture shared latent concepts and adaptively modulate visual features.

If this is right

  • The method extends naturally to composed image retrieval tasks that face similar implicit-modification issues.
  • Adaptive modulation of visual features by latent prototypes becomes a reusable component for other cross-modal matching pipelines.
  • Performance gains on benchmarks follow directly from better handling of concepts that appear only through semantic associations rather than direct depiction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prototype mechanism could be tested on tasks like video question answering where queries contain implicit scene assumptions.
  • If the prototypes prove stable across domains, they might reduce reliance on large explicit training sets for composed retrieval.
  • A natural next measurement would track how prototype quality changes when modification texts vary in length or ambiguity.

Load-bearing premise

Implicit semantics described in modification texts can be reliably materialized as shared latent concepts via dynamic multimodal prototypes and then used to adaptively modulate visual features without explicit visual presentation.

What would settle it

An ablation experiment on the same three benchmarks that removes the dynamic multimodal prototypes and shows retrieval performance no longer exceeds prior explicit-alignment baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.08144 by Chunxiao Wang, Jiale Huang, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Zixu Li.

Figure 1
Figure 1. Figure 1: Example of CVR task and IMAGINE’s main idea. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: IMAGINE Framework: (a) Schema Imagery Construction, (b) Imagery-guided Multimodal Composition, (c) Dual Space [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on the internal designs of the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Parameter sensitivity experiments of IMAGINE on [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on the Dual Space Alignment (DSA) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the attention maps from the SIC [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on WebVid-CoVR and CIRR [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified by a modification text. While existing methods explore cross-modal correspondences, they often assume modified objects appear directly in videos. However, modification texts frequently describe concepts not explicitly presented but implicitly expressed through semantically related visual cues (e.g., "cake" implying "birthday party"). Current approaches typically rely on aligning explicit feature representations within the concrete space, neglecting critical latent associations. To address this, we propose an adaptIve scheMa-ImAGery enhanced composItional NEtwork (IMAGINE). Unlike standard explicit matching, IMAGINE materializes implicit semantics (termed schema imagery) via dynamic multimodal prototypes. These prototypes capture shared latent concepts to adaptively modulate visual features, effectively injecting implicit guidance into the retrieval process. By bridging the gap between explicit visual contents and implicit retrieval intentions, IMAGINE achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes IMAGINE, an adaptive schema-imagery enhanced compositional network for composed video retrieval (CVR). It materializes implicit semantics (termed schema imagery) from modification texts via dynamic multimodal prototypes that capture shared latent concepts, then uses these to adaptively modulate visual features. The central claim is that this bridges explicit visual contents and implicit retrieval intentions, yielding state-of-the-art performance on both CVR and composed image retrieval (CIR) across three benchmarks.

Significance. If the mechanism and SOTA claims were substantiated, the work would address a plausible gap in handling implicit concepts in compositional retrieval. However, the manuscript supplies no equations, architectural diagrams, experimental protocols, baselines, metrics, or results, so no assessment of significance is possible.

major comments (2)
  1. [Abstract] Abstract: The claim that IMAGINE 'achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks' is presented without any supporting experimental details, baselines, metrics, implementation description, or results tables. This renders the central empirical claim unsupported.
  2. [Abstract] Abstract: The core technical description ('dynamic multimodal prototypes' that 'materialize implicit semantics' and 'adaptively modulate visual features') is given at a high level with no equations, pseudocode, or architectural specification, preventing evaluation of whether the approach is well-defined or free of circularity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review. We acknowledge that the submitted abstract presents both the empirical claim and technical description at a high level without supporting details, and we will revise the manuscript to address these deficiencies.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that IMAGINE 'achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks' is presented without any supporting experimental details, baselines, metrics, implementation description, or results tables. This renders the central empirical claim unsupported.

    Authors: We agree that the abstract makes the SOTA claim without including supporting experimental information. We will revise the abstract to incorporate a concise reference to the benchmarks, metrics, and performance outcomes, and we will ensure the full manuscript contains the complete experimental protocols, baselines, metrics, and results tables. revision: yes

  2. Referee: [Abstract] Abstract: The core technical description ('dynamic multimodal prototypes' that 'materialize implicit semantics' and 'adaptively modulate visual features') is given at a high level with no equations, pseudocode, or architectural specification, preventing evaluation of whether the approach is well-defined or free of circularity.

    Authors: We agree that the abstract describes the mechanism at a high level without equations or specifications. We will revise the abstract to reference the key equations and architectural elements from the main text, and we will ensure the full manuscript supplies the mathematical formulations, pseudocode, and diagrams needed to define the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; no derivation chain present

full rationale

The abstract and available context present only a high-level descriptive claim about materializing implicit semantics via dynamic multimodal prototypes, with no equations, parameter-fitting procedures, self-citations of uniqueness theorems, or any derivation chain that could reduce to inputs by construction. No load-bearing steps of the enumerated kinds are identifiable because no mathematical or procedural derivation is exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, providing no information on free parameters, mathematical axioms, or additional invented entities beyond the named method components.

invented entities (1)
  • schema imagery no independent evidence
    purpose: to materialize implicit semantics not explicitly presented in videos
    New term introduced in the abstract for latent concepts captured by prototypes.

pith-pipeline@v0.9.1-grok · 5727 in / 1067 out tokens · 18564 ms · 2026-06-27T19:49:06.949803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval

    cs.CV 2026-06 unverdicted novelty 4.0

    RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.

Reference graph

Works this paper leans on

116 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiao-Yong Wei, Chang Wen Chen, and Qing Li. 2024. Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval. InACM MM. 7249–7258

  2. [2]

    Qianyun Yang, Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, and Liqiang Nie. 2026. STABLE: Efficient Hybrid Nearest Neighbor Search via Magnitude- Uniformity and Cardinality-Robustness.IEEE TKDE(2026)

  3. [3]

    Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, and Liqiang Nie. 2026. R3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking. arXiv preprint arXiv:2606.01113(2026)

  4. [4]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. 2026. HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval. InAAAI, Vol. 40. 6762–6770

  5. [5]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. 2025. OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval. InACM MM. 6113–6122

  6. [6]

    Zheyuan Liu, Cristian Rodriguez Opazo, Damien Teney, and Stephen Gould

  7. [7]

    Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. InICCV. IEEE, 2105–2114

  8. [8]

    Shuxian Li, Changhao He, Xiting Liu, Joey Tianyi Zhou, Xi Peng, and Peng Hu. 2025. Learning with Noisy Triplet Correspondence for Composed Image Retrieval. InCVPR. 19628–19637

  9. [9]

    Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, and Yupeng Hu. 2026. Hint: Composed image retrieval with dual- path compositional contextualized network.arXiv preprint arXiv:2603.26341 (2026)

  10. [10]

    Shilin Lu, Zihan Zhou, Jiayou Lu, Yuanzhi Zhu, and Adams Wai-Kin Kong

  11. [11]

    Robust watermarking using generative priors against image editing: From benchmarking to advances.arXiv preprint arXiv:2410.18775(2024)

  12. [12]

    Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, and Peilin Zhao. 2026. Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training.arXiv preprint arXiv:2602.19225(2026)

  13. [13]

    Jincheng Huang, Yujie Mo, Xiaoshuang Shi, Lei Feng, and Xiaofeng Zhu. 2025. Enhancing the Influence of Labels on Unlabeled Nodes in Graph Convolutional Networks. InICML

  14. [14]

    Yiming Zeng, Wanhao Yu, Zexin Li, Tao Ren, Yu Ma, Jinghan Cao, Xiyan Chen, and Tingting Yu. 2025. Bridging the editing gap in LLMs: FineEdit for precise and targeted text modifications.EMNLP Findings(2025), 2193–2206

  15. [15]

    Yanlong Chen, Amirhossein Habibian, Luca Benini, and Yawei Li. 2026. Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs.arXiv preprint arXiv:2601.22709(2026). doi:10.48550/arXiv.2601.22709

  16. [16]

    Panqi Yang, Haodong Jing, Nanning Zheng, and Yongqiang Ma. 2026. In- strucRobo: Object-centric multi-instruction decoupling model for explainable robotic manipulation.EAAI171 (2026), 114166

  17. [17]

    Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, and Yunpu Ma. 2026. The Geometry of Reasoning: Self-Evaluation via Layerwise Trajectory Evolution. In ICML. https://openreview.net/forum?id=WQyrwQwzmK

  18. [18]

    Yuxuan Jiang, Dawei Li, and Frank Ferraro. 2025. Drp: Distilled reasoning pruning with skill-aware step decomposition for efficient large reasoning models. arXiv preprint arXiv:2505.13975(2025)

  19. [19]

    Jincheng Huang, Jialie Shen, Xiaoshuang Shi, and Xiaofeng Zhu. 2024. On Which Nodes Does GCN Fail? Enhancing GCN From the Node Perspective. In Forty-first International Conference on Machine Learning

  20. [20]

    Xinjin Li, Yu Ma, Yangchen Huang, Xingqi Wang, Yuzhen Lin, and Chenxi Zhang. 2024. Synergized data efficiency and compression (sec) optimization for large language models. InEIECS. IEEE, 586–591

  21. [21]

    Kailin Jiang, Ning Jiang, Yuntao Du, Yuchen Ren, Yuchen Li, Yifan Gao, Jinhe Bi, et al. 2025. MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models.arXiv preprint arXiv:2510.19457(2025)

  22. [22]

    Zichao Li and Zong Ke. 2025. Domain meets typology: Predicting verb-final order from universal dependencies for financial and blockchain nlp. InWorkshop on Research in Computational Linguistic Typology and Multilingual NLP. 156–164

  23. [23]

    Qianyun Yang, Peizhuo Lv, Yingjiu Li, Shengzhi Zhang, Yuxuan Chen, Zhiwei Chen, Zixu Li, and Yupeng Hu. 2026. ERASE: Bypassing Collaborative Detection of AI Counterfeit Via Comprehensive Artifacts Elimination.IEEE TDSC(March 2026), 1–18. doi:10.1109/TDSC.2026.3677794

  24. [24]

    Panqi Yang, Haodong Jing, Jiahao Chao, Tingyan Xiang, Li Lin, et al . 2026. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality.arXiv preprint arXiv:2605.05646(2026)

  25. [25]

    Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. 2024. Mace: Mass concept erasure in diffusion models. InCVPR. 6430–6440

  26. [26]

    Jinlai Zhang, Xiaolong Song, Yucheng Li, Diqing Liang, Zhiyong Zhang, and Jinhu Cai. 2026. Adaptive dual cross-attention network for multispectral object detection in autonomous driving.ESW A(2026), 132012

  27. [27]

    Guancheng Wan, Xiaoran Shang, Yuxin Wu, Guibin Zhang, Jinhe Bi, Liangtao Zheng, Xin Lin, Yue Liu, Yanbiao Ma, Wenke Huang, and Bo Du. 2025. HY- PERION: Fine-Grained Hypersphere Alignment for Robust Federated Graph Learning. InNeurIPS. https://openreview.net/forum?id=TZB6YT8Owr

  28. [28]

    Yuxuan Jiang and Francis Ferraro. 2026. Beyond math: Stories as a testbed for memorization-constrained reasoning in llms. InEACL. 5590–5607

  29. [29]

    Zijian Zhang, Rong Fu, Yangfan He, Xinze Shen, Yanlong Wang, Xiaojing Du, Haochen You, Keyan Jin, Jiazhao Shi, and Simon Fong. 2026. FinSentLLM: Multi-LLM and structured semantic signals for enhanced financial sentiment forecasting. InICASSP. IEEE, 17682–17686

  30. [30]

    Xingfeng Li, Yuangang Pan, Yuan Sun, Quansen Sun, Yinghui Sun, Ivor W Tsang, and Zhenwen Ren. 2024. Incomplete multi-view clustering with paired and balanced dynamic anchor learning.IEEE TMM27 (2024), 1486–1497

  31. [31]

    Siyuan Li, Youyuan Zhang, Fangming Liu, and Jing Li. 2026. Modality-Decoupled Online Recursive Editing.arXiv preprint arXiv:2605.20273(2026)

  32. [32]

    Mengdi Li, Jiaye Lin, Xufeng Zhao, Wenhao Lu, Peilin Zhao, Stefan Wermter, and Di Wang. 2025. Curriculum-rlaif: Curriculum alignment with reinforcement learning from ai feedback.arXiv preprint arXiv:2505.20075(2025)

  33. [33]

    Zichao Li and Zong Ke. 2025. Cross-modal augmentation for low-resource language understanding and generation. InMAGMaR. 90–99

  34. [34]

    Yang Tian, Fan Liu, Jingyuan Zhang, Yupeng Hu, Liqiang Nie, et al. 2025. CoRe- MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG. InACL. 32967–32982

  35. [35]

    Leyang Li, Shilin Lu, Yan Ren, and Adams Wai-Kin Kong. 2025. Set you straight: Auto-steering denoising trajectories to sidestep unwanted concepts.arXiv preprint arXiv:2504.12782(2025)

  36. [36]

    Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. 2024. CoVR: Learning composed video retrieval from web video captions. InAAAI, Vol. 38. 5270–5279

  37. [37]

    Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, et al. 2024. Composed video retrieval via enriched context and discriminative embeddings. InCVPR. 26896–26906

  38. [38]

    Yupeng Hu, Zixu Li, Zhiwei Chen, Qinlei Huang, Zhiheng Fu, Mingzhu Xu, and Liqiang Nie. 2026. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement.ACM ToMM(2026)

  39. [39]

    Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. 2024. CoVR- 2: Automatic Data Construction for Composed Video Retrieval.IEEE TPAMI (2024)

  40. [40]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. 2026. ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval. InAAAI, Vol. 40. 23373– 23381

  41. [41]

    WU Yue, Zhaobo Qi, Yiling Wu, Junshu Sun, Yaowei Wang, and Shuhui Wang

  42. [42]

    Learning Fine-Grained Representations through Textual Token Disentan- glement in Composed Video Retrieval. InICLR

  43. [43]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan

  44. [44]

    InACM MM

    HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval. InACM MM. 6143–6152

  45. [45]

    Zong Ke, Yuqing Cao, Zhenrui Chen, Yuchen Yin, Shouchao He, and Yu Cheng

  46. [46]

    Finance Research Letters(2025), 107890

    Early warning of cryptocurrency reversal risks via multi-source data. Finance Research Letters(2025), 107890

  47. [47]

    Haokun Wen, Xuemeng Song, Jianhua Yin, Jianlong Wu, Weili Guan, and Liqiang Nie. 2024. Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval.IEEE TPAMI46, 5 (2024), 3665–3678

  48. [48]

    Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, and Qing Li. 2025. KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints.arXiv preprint arXiv:2510.19316(2025)

  49. [49]

    Ruanzhi Jiao, Jinlai Zhang, Chang Li, and Lin Hu. 2026. Large-kernel spatially parallel feature fusion for monocular 3D perception in autonomous driving. KBS343 (2026), 115998

  50. [50]

    Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, et al. 2025. Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007(2025)

  51. [51]

    Yuxuan Jiang et al. 2026. SCRIBE: Structured Mid-Level Supervision for Tool- Using Language Models.arXiv preprint arXiv:2601.03555(2026)

  52. [52]

    Jinhe Bi, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, et al. 2026. EchoRL: Reinforcement Learning via Rollout Echoing.arXiv preprint arXiv:2605.31228(2026)

  53. [53]

    Jincheng Huang, Lun Du, Xu Chen, Qiang Fu, et al . 2023. Robust mid-pass filtering graph convolutional networks. InACM WWW. 328–338

  54. [54]

    Gengyuan Zhang, Jinhe Bi, Jindong Gu, Yanyu Chen, and Volker Tresp. 2023. SPOT! Revisiting Video-Language Models for Event Understanding.arXiv preprint arXiv:2311.12919(2023)

  55. [55]

    Jiazhao Shi, Yichen Lin, Yiheng Hua, Ziyu Wang, Zijian Zhang, Wenjia Zheng, Yun Song, Kuan Lu, and Shoufeng Lu. 2026. Multiscenario highway lane- change intention prediction: a physics-informed AI framework for three-class classification. InSTCE, Vol. 14120. SPIE, 129–145. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Huang et al

  56. [56]

    Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, and Xunliang Cai. 2026. MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample- Efficient LLM Reasoning.arXiv preprint arXiv:2602.17550(2026)

  57. [57]

    Jinlai Zhang, Mingchao Xiang, Yongheng Hu, Wei Hao, et al. 2026. Multivariate feature learning and associative spatial information enhancement for snow object detection in autonomous driving.EAAI175 (2026), 114672

  58. [58]

    Jincheng Huang, Jie Xu, Xiaoshuang Shi, Ping Hu, Lei Feng, et al. 2026. Revisiting Confidence Calibration for Misclassification Detection in VLMs. InICLR

  59. [59]

    Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, and Xiaobin Hu. 2025. Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling.arXiv preprint arXiv:2508.03404(2025)

  60. [60]

    Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, and Liqiang Nie

  61. [61]

    FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval.https://arxiv.org/abs/2503.21309(2025)

  62. [62]

    Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang, Qinlei Huang, and Yinwei Wei. 2026. INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval. InAAAI, Vol. 40. 20463–20471

  63. [63]

    Zhiheng Fu, Yupeng Hu, Qianyun Yang, Shiqi Zhang, Zhiwei Chen, and Zixu Li

  64. [64]

    Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval.arXiv preprint arXiv:2604.19386(2026)

  65. [65]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Mingyu Zhang, Zhiheng Fu, and Liqiang Nie

  66. [66]

    ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval.arXiv preprint arXiv:2604.20358(2026)

  67. [67]

    Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint attribute manipulation and modality alignment learning for composing text and image to image retrieval. InACM MM. ACM, 3367–3376

  68. [68]

    Yuchen Yang, Min Wang, Wengang Zhou, and Houqiang Li. 2021. Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval. InACM MM. ACM, 3303–3311

  69. [69]

    Gangjian Zhang, Shikui Wei, Huaxin Pang, Shuang Qiu, and Yao Zhao. 2022. Composed Image Retrieval via Explicit Erasure and Replenishment With Se- mantic Alignment.IEEE TIP31 (2022), 5976–5988

  70. [70]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Haokun Wen, Xuemeng Song, and Liqiang Nie. 2026. COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations.IEEE TIP(2026)

  71. [71]

    Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, and Liqiang Nie. 2026. TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval.arXiv preprint arXiv:2604.21806(2026)

  72. [72]

    Yida Zhao, Yuqing Song, and Qin Jin. 2022. Progressive learning for image retrieval with hybrid-modality queries. InSIGIR. 1012–1021

  73. [73]

    Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, and Shujuan Huang

  74. [74]

    AMC: Adaptive Multi-Expert Collaborative Network for Text-Guided Image Retrieval.ACM ToMM(2023)

  75. [75]

    Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, and Zhao Yang

  76. [76]

    Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation.arXiv preprint arXiv:2605.09253(2026)

  77. [77]

    Yuan Sun, Yang Qin, Yongxiang Li, Dezhong Peng, et al. 2024. Robust multi-view clustering with noisy correspondence.IEEE TKDE36, 12 (2024), 9150–9162

  78. [78]

    Zhiheng Fu, Zixu Li, Zhiwei Chen, Fangxu Liu, Yupeng Hu, Weili Guan, and Liqiang Nie. 2026. EgoAction: Egocentric Action Composition with Reliability- Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026.arXiv preprint arXiv:2605.24496(2026)

  79. [79]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Weili Guan, and Liqiang Nie. 2026. TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge.arXiv preprint arXiv:2605.24470(2026)

  80. [80]

    Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta, and Hengyuan Zhang

Showing first 80 references.