IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

Chunxiao Wang; Jiale Huang; Yupeng Hu; Zhiheng Fu; Zhiwei Chen; Zixu Li

arxiv: 2606.08144 · v1 · pith:4LC4WZWCnew · submitted 2026-06-06 · 💻 cs.CV

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

Jiale Huang , Zixu Li , Zhiwei Chen , Zhiheng Fu , Chunxiao Wang , Yupeng Hu This is my paper

Pith reviewed 2026-06-27 19:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords composed video retrievalschema imagerydynamic multimodal prototypesimplicit semanticsadaptive feature modulationcomposed image retrieval

0 comments

The pith

Dynamic multimodal prototypes materialize implicit semantics from modification texts to guide composed video retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses composed video retrieval where modification texts often describe concepts implied through related visual cues rather than shown directly. It introduces IMAGINE to materialize these implicit semantics, called schema imagery, as shared latent concepts captured by dynamic multimodal prototypes. The prototypes then adaptively modulate visual features to inject the implicit guidance into matching. This bridges explicit video contents with retrieval intentions that are not visually explicit. The approach yields state-of-the-art results on standard CVR and CIR benchmarks.

Core claim

IMAGINE materializes implicit semantics termed schema imagery via dynamic multimodal prototypes. These prototypes capture shared latent concepts to adaptively modulate visual features, effectively injecting implicit guidance into the retrieval process and achieving state-of-the-art performance in both CVR and CIR across three benchmarks.

What carries the argument

Dynamic multimodal prototypes that materialize schema imagery to capture shared latent concepts and adaptively modulate visual features.

If this is right

The method extends naturally to composed image retrieval tasks that face similar implicit-modification issues.
Adaptive modulation of visual features by latent prototypes becomes a reusable component for other cross-modal matching pipelines.
Performance gains on benchmarks follow directly from better handling of concepts that appear only through semantic associations rather than direct depiction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prototype mechanism could be tested on tasks like video question answering where queries contain implicit scene assumptions.
If the prototypes prove stable across domains, they might reduce reliance on large explicit training sets for composed retrieval.
A natural next measurement would track how prototype quality changes when modification texts vary in length or ambiguity.

Load-bearing premise

Implicit semantics described in modification texts can be reliably materialized as shared latent concepts via dynamic multimodal prototypes and then used to adaptively modulate visual features without explicit visual presentation.

What would settle it

An ablation experiment on the same three benchmarks that removes the dynamic multimodal prototypes and shows retrieval performance no longer exceeds prior explicit-alignment baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.08144 by Chunxiao Wang, Jiale Huang, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Zixu Li.

**Figure 2.** Figure 2: IMAGINE Framework: (a) Schema Imagery Construction, (b) Imagery-guided Multimodal Composition, (c) Dual Space [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study on the internal designs of the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 6.** Figure 6: Parameter sensitivity experiments of IMAGINE on [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on the Dual Space Alignment (DSA) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Visualization of the attention maps from the SIC [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on WebVid-CoVR and CIRR [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified by a modification text. While existing methods explore cross-modal correspondences, they often assume modified objects appear directly in videos. However, modification texts frequently describe concepts not explicitly presented but implicitly expressed through semantically related visual cues (e.g., "cake" implying "birthday party"). Current approaches typically rely on aligning explicit feature representations within the concrete space, neglecting critical latent associations. To address this, we propose an adaptIve scheMa-ImAGery enhanced composItional NEtwork (IMAGINE). Unlike standard explicit matching, IMAGINE materializes implicit semantics (termed schema imagery) via dynamic multimodal prototypes. These prototypes capture shared latent concepts to adaptively modulate visual features, effectively injecting implicit guidance into the retrieval process. By bridging the gap between explicit visual contents and implicit retrieval intentions, IMAGINE achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IMAGINE frames implicit semantics in composed video retrieval as schema imagery captured by dynamic multimodal prototypes, but the abstract supplies zero technical or empirical support for the SOTA claim.

read the letter

The main point is that this paper identifies a gap in composed video retrieval where modification texts often point to unshown concepts, such as inferring a party from the word cake, and proposes to materialize those as schema imagery through dynamic multimodal prototypes that then modulate the visual features.

It does a clear job of stating why standard explicit feature alignment misses these latent associations and why that matters for real retrieval queries.

The framing around schema imagery and adaptive modulation is presented as the novel piece, though it is impossible to tell from the abstract whether this is a genuine step beyond existing multimodal prototype or attention techniques.

The obvious soft spot is the total lack of any equations, architecture diagram, training procedure, baselines, or numbers. The abstract simply asserts state-of-the-art results on three benchmarks for both CVR and CIR without showing a single metric or comparison. That makes the central claim unevaluable and leaves the key assumption—that the prototypes can reliably turn implicit text into usable latent guidance—unsupported.

No circularity or fitting issues can be checked because nothing is shown. The citation pattern is also invisible here.

This is for people already working on composed image or video retrieval. A specialist might pick up the implicit-semantics angle as a prompt for their own experiments, but the paper is not ready for broader use until the method and results are visible.

It should go to peer review so the details can be examined; the problem it names is legitimate even if the current write-up gives no evidence the fix works.

Referee Report

2 major / 0 minor

Summary. The paper proposes IMAGINE, an adaptive schema-imagery enhanced compositional network for composed video retrieval (CVR). It materializes implicit semantics (termed schema imagery) from modification texts via dynamic multimodal prototypes that capture shared latent concepts, then uses these to adaptively modulate visual features. The central claim is that this bridges explicit visual contents and implicit retrieval intentions, yielding state-of-the-art performance on both CVR and composed image retrieval (CIR) across three benchmarks.

Significance. If the mechanism and SOTA claims were substantiated, the work would address a plausible gap in handling implicit concepts in compositional retrieval. However, the manuscript supplies no equations, architectural diagrams, experimental protocols, baselines, metrics, or results, so no assessment of significance is possible.

major comments (2)

[Abstract] Abstract: The claim that IMAGINE 'achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks' is presented without any supporting experimental details, baselines, metrics, implementation description, or results tables. This renders the central empirical claim unsupported.
[Abstract] Abstract: The core technical description ('dynamic multimodal prototypes' that 'materialize implicit semantics' and 'adaptively modulate visual features') is given at a high level with no equations, pseudocode, or architectural specification, preventing evaluation of whether the approach is well-defined or free of circularity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review. We acknowledge that the submitted abstract presents both the empirical claim and technical description at a high level without supporting details, and we will revise the manuscript to address these deficiencies.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that IMAGINE 'achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks' is presented without any supporting experimental details, baselines, metrics, implementation description, or results tables. This renders the central empirical claim unsupported.

Authors: We agree that the abstract makes the SOTA claim without including supporting experimental information. We will revise the abstract to incorporate a concise reference to the benchmarks, metrics, and performance outcomes, and we will ensure the full manuscript contains the complete experimental protocols, baselines, metrics, and results tables. revision: yes
Referee: [Abstract] Abstract: The core technical description ('dynamic multimodal prototypes' that 'materialize implicit semantics' and 'adaptively modulate visual features') is given at a high level with no equations, pseudocode, or architectural specification, preventing evaluation of whether the approach is well-defined or free of circularity.

Authors: We agree that the abstract describes the mechanism at a high level without equations or specifications. We will revise the abstract to reference the key equations and architectural elements from the main text, and we will ensure the full manuscript supplies the mathematical formulations, pseudocode, and diagrams needed to define the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; no derivation chain present

full rationale

The abstract and available context present only a high-level descriptive claim about materializing implicit semantics via dynamic multimodal prototypes, with no equations, parameter-fitting procedures, self-citations of uniqueness theorems, or any derivation chain that could reduce to inputs by construction. No load-bearing steps of the enumerated kinds are identifiable because no mathematical or procedural derivation is exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, providing no information on free parameters, mathematical axioms, or additional invented entities beyond the named method components.

invented entities (1)

schema imagery no independent evidence
purpose: to materialize implicit semantics not explicitly presented in videos
New term introduced in the abstract for latent concepts captured by prototypes.

pith-pipeline@v0.9.1-grok · 5727 in / 1067 out tokens · 18564 ms · 2026-06-27T19:49:06.949803+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval
cs.CV 2026-06 unverdicted novelty 4.0

RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.

Reference graph

Works this paper leans on

116 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiao-Yong Wei, Chang Wen Chen, and Qing Li. 2024. Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval. InACM MM. 7249–7258

2024
[2]

Qianyun Yang, Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, and Liqiang Nie. 2026. STABLE: Efficient Hybrid Nearest Neighbor Search via Magnitude- Uniformity and Cardinality-Robustness.IEEE TKDE(2026)

2026
[3]

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, and Liqiang Nie. 2026. R3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking. arXiv preprint arXiv:2606.01113(2026)

Pith/arXiv arXiv 2026
[4]

Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. 2026. HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval. InAAAI, Vol. 40. 6762–6770

2026
[5]

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. 2025. OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval. InACM MM. 6113–6122

2025
[6]

Zheyuan Liu, Cristian Rodriguez Opazo, Damien Teney, and Stephen Gould
[7]

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. InICCV. IEEE, 2105–2114
[8]

Shuxian Li, Changhao He, Xiting Liu, Joey Tianyi Zhou, Xi Peng, and Peng Hu. 2025. Learning with Noisy Triplet Correspondence for Composed Image Retrieval. InCVPR. 19628–19637

2025
[9]

Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, and Yupeng Hu. 2026. Hint: Composed image retrieval with dual- path compositional contextualized network.arXiv preprint arXiv:2603.26341 (2026)

arXiv 2026
[10]

Shilin Lu, Zihan Zhou, Jiayou Lu, Yuanzhi Zhu, and Adams Wai-Kin Kong
[11]

Robust watermarking using generative priors against image editing: From benchmarking to advances.arXiv preprint arXiv:2410.18775(2024)

arXiv 2024
[12]

Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, and Peilin Zhao. 2026. Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training.arXiv preprint arXiv:2602.19225(2026)

arXiv 2026
[13]

Jincheng Huang, Yujie Mo, Xiaoshuang Shi, Lei Feng, and Xiaofeng Zhu. 2025. Enhancing the Influence of Labels on Unlabeled Nodes in Graph Convolutional Networks. InICML

2025
[14]

Yiming Zeng, Wanhao Yu, Zexin Li, Tao Ren, Yu Ma, Jinghan Cao, Xiyan Chen, and Tingting Yu. 2025. Bridging the editing gap in LLMs: FineEdit for precise and targeted text modifications.EMNLP Findings(2025), 2193–2206

2025
[15]

Yanlong Chen, Amirhossein Habibian, Luca Benini, and Yawei Li. 2026. Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs.arXiv preprint arXiv:2601.22709(2026). doi:10.48550/arXiv.2601.22709

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.22709 2026
[16]

Panqi Yang, Haodong Jing, Nanning Zheng, and Yongqiang Ma. 2026. In- strucRobo: Object-centric multi-instruction decoupling model for explainable robotic manipulation.EAAI171 (2026), 114166

2026
[17]

Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, and Yunpu Ma. 2026. The Geometry of Reasoning: Self-Evaluation via Layerwise Trajectory Evolution. In ICML. https://openreview.net/forum?id=WQyrwQwzmK

2026
[18]

Yuxuan Jiang, Dawei Li, and Frank Ferraro. 2025. Drp: Distilled reasoning pruning with skill-aware step decomposition for efficient large reasoning models. arXiv preprint arXiv:2505.13975(2025)

Pith/arXiv arXiv 2025
[19]

Jincheng Huang, Jialie Shen, Xiaoshuang Shi, and Xiaofeng Zhu. 2024. On Which Nodes Does GCN Fail? Enhancing GCN From the Node Perspective. In Forty-first International Conference on Machine Learning

2024
[20]

Xinjin Li, Yu Ma, Yangchen Huang, Xingqi Wang, Yuzhen Lin, and Chenxi Zhang. 2024. Synergized data efficiency and compression (sec) optimization for large language models. InEIECS. IEEE, 586–591

2024
[21]

Kailin Jiang, Ning Jiang, Yuntao Du, Yuchen Ren, Yuchen Li, Yifan Gao, Jinhe Bi, et al. 2025. MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models.arXiv preprint arXiv:2510.19457(2025)

Pith/arXiv arXiv 2025
[22]

Zichao Li and Zong Ke. 2025. Domain meets typology: Predicting verb-final order from universal dependencies for financial and blockchain nlp. InWorkshop on Research in Computational Linguistic Typology and Multilingual NLP. 156–164

2025
[23]

Qianyun Yang, Peizhuo Lv, Yingjiu Li, Shengzhi Zhang, Yuxuan Chen, Zhiwei Chen, Zixu Li, and Yupeng Hu. 2026. ERASE: Bypassing Collaborative Detection of AI Counterfeit Via Comprehensive Artifacts Elimination.IEEE TDSC(March 2026), 1–18. doi:10.1109/TDSC.2026.3677794

work page doi:10.1109/tdsc.2026.3677794 2026
[24]

Panqi Yang, Haodong Jing, Jiahao Chao, Tingyan Xiang, Li Lin, et al . 2026. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality.arXiv preprint arXiv:2605.05646(2026)

Pith/arXiv arXiv 2026
[25]

Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. 2024. Mace: Mass concept erasure in diffusion models. InCVPR. 6430–6440

2024
[26]

Jinlai Zhang, Xiaolong Song, Yucheng Li, Diqing Liang, Zhiyong Zhang, and Jinhu Cai. 2026. Adaptive dual cross-attention network for multispectral object detection in autonomous driving.ESW A(2026), 132012

2026
[27]

Guancheng Wan, Xiaoran Shang, Yuxin Wu, Guibin Zhang, Jinhe Bi, Liangtao Zheng, Xin Lin, Yue Liu, Yanbiao Ma, Wenke Huang, and Bo Du. 2025. HY- PERION: Fine-Grained Hypersphere Alignment for Robust Federated Graph Learning. InNeurIPS. https://openreview.net/forum?id=TZB6YT8Owr

2025
[28]

Yuxuan Jiang and Francis Ferraro. 2026. Beyond math: Stories as a testbed for memorization-constrained reasoning in llms. InEACL. 5590–5607

2026
[29]

Zijian Zhang, Rong Fu, Yangfan He, Xinze Shen, Yanlong Wang, Xiaojing Du, Haochen You, Keyan Jin, Jiazhao Shi, and Simon Fong. 2026. FinSentLLM: Multi-LLM and structured semantic signals for enhanced financial sentiment forecasting. InICASSP. IEEE, 17682–17686

2026
[30]

Xingfeng Li, Yuangang Pan, Yuan Sun, Quansen Sun, Yinghui Sun, Ivor W Tsang, and Zhenwen Ren. 2024. Incomplete multi-view clustering with paired and balanced dynamic anchor learning.IEEE TMM27 (2024), 1486–1497

2024
[31]

Siyuan Li, Youyuan Zhang, Fangming Liu, and Jing Li. 2026. Modality-Decoupled Online Recursive Editing.arXiv preprint arXiv:2605.20273(2026)

Pith/arXiv arXiv 2026
[32]

Mengdi Li, Jiaye Lin, Xufeng Zhao, Wenhao Lu, Peilin Zhao, Stefan Wermter, and Di Wang. 2025. Curriculum-rlaif: Curriculum alignment with reinforcement learning from ai feedback.arXiv preprint arXiv:2505.20075(2025)

Pith/arXiv arXiv 2025
[33]

Zichao Li and Zong Ke. 2025. Cross-modal augmentation for low-resource language understanding and generation. InMAGMaR. 90–99

2025
[34]

Yang Tian, Fan Liu, Jingyuan Zhang, Yupeng Hu, Liqiang Nie, et al. 2025. CoRe- MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG. InACL. 32967–32982

2025
[35]

Leyang Li, Shilin Lu, Yan Ren, and Adams Wai-Kin Kong. 2025. Set you straight: Auto-steering denoising trajectories to sidestep unwanted concepts.arXiv preprint arXiv:2504.12782(2025)

arXiv 2025
[36]

Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. 2024. CoVR: Learning composed video retrieval from web video captions. InAAAI, Vol. 38. 5270–5279

2024
[37]

Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, et al. 2024. Composed video retrieval via enriched context and discriminative embeddings. InCVPR. 26896–26906

2024
[38]

Yupeng Hu, Zixu Li, Zhiwei Chen, Qinlei Huang, Zhiheng Fu, Mingzhu Xu, and Liqiang Nie. 2026. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement.ACM ToMM(2026)

2026
[39]

Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. 2024. CoVR- 2: Automatic Data Construction for Composed Video Retrieval.IEEE TPAMI (2024)

2024
[40]

Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. 2026. ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval. InAAAI, Vol. 40. 23373– 23381

2026
[41]

WU Yue, Zhaobo Qi, Yiling Wu, Junshu Sun, Yaowei Wang, and Shuhui Wang
[42]

Learning Fine-Grained Representations through Textual Token Disentan- glement in Composed Video Retrieval. InICLR
[43]

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan
[44]

InACM MM

HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval. InACM MM. 6143–6152
[45]

Zong Ke, Yuqing Cao, Zhenrui Chen, Yuchen Yin, Shouchao He, and Yu Cheng
[46]

Finance Research Letters(2025), 107890

Early warning of cryptocurrency reversal risks via multi-source data. Finance Research Letters(2025), 107890

2025
[47]

Haokun Wen, Xuemeng Song, Jianhua Yin, Jianlong Wu, Weili Guan, and Liqiang Nie. 2024. Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval.IEEE TPAMI46, 5 (2024), 3665–3678

2024
[48]

Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, and Qing Li. 2025. KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints.arXiv preprint arXiv:2510.19316(2025)

Pith/arXiv arXiv 2025
[49]

Ruanzhi Jiao, Jinlai Zhang, Chang Li, and Lin Hu. 2026. Large-kernel spatially parallel feature fusion for monocular 3D perception in autonomous driving. KBS343 (2026), 115998

2026
[50]

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, et al. 2025. Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007(2025)

arXiv 2025
[51]

Yuxuan Jiang et al. 2026. SCRIBE: Structured Mid-Level Supervision for Tool- Using Language Models.arXiv preprint arXiv:2601.03555(2026)

Pith/arXiv arXiv 2026
[52]

Jinhe Bi, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, et al. 2026. EchoRL: Reinforcement Learning via Rollout Echoing.arXiv preprint arXiv:2605.31228(2026)

Pith/arXiv arXiv 2026
[53]

Jincheng Huang, Lun Du, Xu Chen, Qiang Fu, et al . 2023. Robust mid-pass filtering graph convolutional networks. InACM WWW. 328–338

2023
[54]

Gengyuan Zhang, Jinhe Bi, Jindong Gu, Yanyu Chen, and Volker Tresp. 2023. SPOT! Revisiting Video-Language Models for Event Understanding.arXiv preprint arXiv:2311.12919(2023)

arXiv 2023
[55]

Jiazhao Shi, Yichen Lin, Yiheng Hua, Ziyu Wang, Zijian Zhang, Wenjia Zheng, Yun Song, Kuan Lu, and Shoufeng Lu. 2026. Multiscenario highway lane- change intention prediction: a physics-informed AI framework for three-class classification. InSTCE, Vol. 14120. SPIE, 129–145. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Huang et al

2026
[56]

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, and Xunliang Cai. 2026. MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample- Efficient LLM Reasoning.arXiv preprint arXiv:2602.17550(2026)

Pith/arXiv arXiv 2026
[57]

Jinlai Zhang, Mingchao Xiang, Yongheng Hu, Wei Hao, et al. 2026. Multivariate feature learning and associative spatial information enhancement for snow object detection in autonomous driving.EAAI175 (2026), 114672

2026
[58]

Jincheng Huang, Jie Xu, Xiaoshuang Shi, Ping Hu, Lei Feng, et al. 2026. Revisiting Confidence Calibration for Misclassification Detection in VLMs. InICLR

2026
[59]

Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, and Xiaobin Hu. 2025. Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling.arXiv preprint arXiv:2508.03404(2025)

arXiv 2025
[60]

Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, and Liqiang Nie
[61]

FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval.https://arxiv.org/abs/2503.21309(2025)

arXiv 2025
[62]

Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang, Qinlei Huang, and Yinwei Wei. 2026. INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval. InAAAI, Vol. 40. 20463–20471

2026
[63]

Zhiheng Fu, Yupeng Hu, Qianyun Yang, Shiqi Zhang, Zhiwei Chen, and Zixu Li
[64]

Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval.arXiv preprint arXiv:2604.19386(2026)

Pith/arXiv arXiv 2026
[65]

Zixu Li, Yupeng Hu, Zhiwei Chen, Mingyu Zhang, Zhiheng Fu, and Liqiang Nie
[66]

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval.arXiv preprint arXiv:2604.20358(2026)

Pith/arXiv arXiv 2026
[67]

Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint attribute manipulation and modality alignment learning for composing text and image to image retrieval. InACM MM. ACM, 3367–3376

2020
[68]

Yuchen Yang, Min Wang, Wengang Zhou, and Houqiang Li. 2021. Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval. InACM MM. ACM, 3303–3311

2021
[69]

Gangjian Zhang, Shikui Wei, Huaxin Pang, Shuang Qiu, and Yao Zhao. 2022. Composed Image Retrieval via Explicit Erasure and Replenishment With Se- mantic Alignment.IEEE TIP31 (2022), 5976–5988

2022
[70]

Zixu Li, Yupeng Hu, Zhiwei Chen, Haokun Wen, Xuemeng Song, and Liqiang Nie. 2026. COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations.IEEE TIP(2026)

2026
[71]

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, and Liqiang Nie. 2026. TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval.arXiv preprint arXiv:2604.21806(2026)

Pith/arXiv arXiv 2026
[72]

Yida Zhao, Yuqing Song, and Qin Jin. 2022. Progressive learning for image retrieval with hybrid-modality queries. InSIGIR. 1012–1021

2022
[73]

Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, and Shujuan Huang
[74]

AMC: Adaptive Multi-Expert Collaborative Network for Text-Guided Image Retrieval.ACM ToMM(2023)

2023
[75]

Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, and Zhao Yang
[76]

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation.arXiv preprint arXiv:2605.09253(2026)

Pith/arXiv arXiv 2026
[77]

Yuan Sun, Yang Qin, Yongxiang Li, Dezhong Peng, et al. 2024. Robust multi-view clustering with noisy correspondence.IEEE TKDE36, 12 (2024), 9150–9162

2024
[78]

Zhiheng Fu, Zixu Li, Zhiwei Chen, Fangxu Liu, Yupeng Hu, Weili Guan, and Liqiang Nie. 2026. EgoAction: Egocentric Action Composition with Reliability- Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026.arXiv preprint arXiv:2605.24496(2026)

Pith/arXiv arXiv 2026
[79]

Zixu Li, Yupeng Hu, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Weili Guan, and Liqiang Nie. 2026. TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge.arXiv preprint arXiv:2605.24470(2026)

Pith/arXiv arXiv 2026
[80]

Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta, and Hengyuan Zhang

Showing first 80 references.

[1] [1]

Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiao-Yong Wei, Chang Wen Chen, and Qing Li. 2024. Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval. InACM MM. 7249–7258

2024

[2] [2]

Qianyun Yang, Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, and Liqiang Nie. 2026. STABLE: Efficient Hybrid Nearest Neighbor Search via Magnitude- Uniformity and Cardinality-Robustness.IEEE TKDE(2026)

2026

[3] [3]

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, and Liqiang Nie. 2026. R3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking. arXiv preprint arXiv:2606.01113(2026)

Pith/arXiv arXiv 2026

[4] [4]

Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. 2026. HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval. InAAAI, Vol. 40. 6762–6770

2026

[5] [5]

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. 2025. OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval. InACM MM. 6113–6122

2025

[6] [6]

Zheyuan Liu, Cristian Rodriguez Opazo, Damien Teney, and Stephen Gould

[7] [7]

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. InICCV. IEEE, 2105–2114

[8] [8]

Shuxian Li, Changhao He, Xiting Liu, Joey Tianyi Zhou, Xi Peng, and Peng Hu. 2025. Learning with Noisy Triplet Correspondence for Composed Image Retrieval. InCVPR. 19628–19637

2025

[9] [9]

Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, and Yupeng Hu. 2026. Hint: Composed image retrieval with dual- path compositional contextualized network.arXiv preprint arXiv:2603.26341 (2026)

arXiv 2026

[10] [10]

Shilin Lu, Zihan Zhou, Jiayou Lu, Yuanzhi Zhu, and Adams Wai-Kin Kong

[11] [11]

Robust watermarking using generative priors against image editing: From benchmarking to advances.arXiv preprint arXiv:2410.18775(2024)

arXiv 2024

[12] [12]

Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, and Peilin Zhao. 2026. Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training.arXiv preprint arXiv:2602.19225(2026)

arXiv 2026

[13] [13]

Jincheng Huang, Yujie Mo, Xiaoshuang Shi, Lei Feng, and Xiaofeng Zhu. 2025. Enhancing the Influence of Labels on Unlabeled Nodes in Graph Convolutional Networks. InICML

2025

[14] [14]

Yiming Zeng, Wanhao Yu, Zexin Li, Tao Ren, Yu Ma, Jinghan Cao, Xiyan Chen, and Tingting Yu. 2025. Bridging the editing gap in LLMs: FineEdit for precise and targeted text modifications.EMNLP Findings(2025), 2193–2206

2025

[15] [15]

Yanlong Chen, Amirhossein Habibian, Luca Benini, and Yawei Li. 2026. Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs.arXiv preprint arXiv:2601.22709(2026). doi:10.48550/arXiv.2601.22709

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.22709 2026

[16] [16]

Panqi Yang, Haodong Jing, Nanning Zheng, and Yongqiang Ma. 2026. In- strucRobo: Object-centric multi-instruction decoupling model for explainable robotic manipulation.EAAI171 (2026), 114166

2026

[17] [17]

Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, and Yunpu Ma. 2026. The Geometry of Reasoning: Self-Evaluation via Layerwise Trajectory Evolution. In ICML. https://openreview.net/forum?id=WQyrwQwzmK

2026

[18] [18]

Yuxuan Jiang, Dawei Li, and Frank Ferraro. 2025. Drp: Distilled reasoning pruning with skill-aware step decomposition for efficient large reasoning models. arXiv preprint arXiv:2505.13975(2025)

Pith/arXiv arXiv 2025

[19] [19]

Jincheng Huang, Jialie Shen, Xiaoshuang Shi, and Xiaofeng Zhu. 2024. On Which Nodes Does GCN Fail? Enhancing GCN From the Node Perspective. In Forty-first International Conference on Machine Learning

2024

[20] [20]

Xinjin Li, Yu Ma, Yangchen Huang, Xingqi Wang, Yuzhen Lin, and Chenxi Zhang. 2024. Synergized data efficiency and compression (sec) optimization for large language models. InEIECS. IEEE, 586–591

2024

[21] [21]

Kailin Jiang, Ning Jiang, Yuntao Du, Yuchen Ren, Yuchen Li, Yifan Gao, Jinhe Bi, et al. 2025. MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models.arXiv preprint arXiv:2510.19457(2025)

Pith/arXiv arXiv 2025

[22] [22]

Zichao Li and Zong Ke. 2025. Domain meets typology: Predicting verb-final order from universal dependencies for financial and blockchain nlp. InWorkshop on Research in Computational Linguistic Typology and Multilingual NLP. 156–164

2025

[23] [23]

Qianyun Yang, Peizhuo Lv, Yingjiu Li, Shengzhi Zhang, Yuxuan Chen, Zhiwei Chen, Zixu Li, and Yupeng Hu. 2026. ERASE: Bypassing Collaborative Detection of AI Counterfeit Via Comprehensive Artifacts Elimination.IEEE TDSC(March 2026), 1–18. doi:10.1109/TDSC.2026.3677794

work page doi:10.1109/tdsc.2026.3677794 2026

[24] [24]

Panqi Yang, Haodong Jing, Jiahao Chao, Tingyan Xiang, Li Lin, et al . 2026. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality.arXiv preprint arXiv:2605.05646(2026)

Pith/arXiv arXiv 2026

[25] [25]

Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. 2024. Mace: Mass concept erasure in diffusion models. InCVPR. 6430–6440

2024

[26] [26]

Jinlai Zhang, Xiaolong Song, Yucheng Li, Diqing Liang, Zhiyong Zhang, and Jinhu Cai. 2026. Adaptive dual cross-attention network for multispectral object detection in autonomous driving.ESW A(2026), 132012

2026

[27] [27]

Guancheng Wan, Xiaoran Shang, Yuxin Wu, Guibin Zhang, Jinhe Bi, Liangtao Zheng, Xin Lin, Yue Liu, Yanbiao Ma, Wenke Huang, and Bo Du. 2025. HY- PERION: Fine-Grained Hypersphere Alignment for Robust Federated Graph Learning. InNeurIPS. https://openreview.net/forum?id=TZB6YT8Owr

2025

[28] [28]

Yuxuan Jiang and Francis Ferraro. 2026. Beyond math: Stories as a testbed for memorization-constrained reasoning in llms. InEACL. 5590–5607

2026

[29] [29]

Zijian Zhang, Rong Fu, Yangfan He, Xinze Shen, Yanlong Wang, Xiaojing Du, Haochen You, Keyan Jin, Jiazhao Shi, and Simon Fong. 2026. FinSentLLM: Multi-LLM and structured semantic signals for enhanced financial sentiment forecasting. InICASSP. IEEE, 17682–17686

2026

[30] [30]

Xingfeng Li, Yuangang Pan, Yuan Sun, Quansen Sun, Yinghui Sun, Ivor W Tsang, and Zhenwen Ren. 2024. Incomplete multi-view clustering with paired and balanced dynamic anchor learning.IEEE TMM27 (2024), 1486–1497

2024

[31] [31]

Siyuan Li, Youyuan Zhang, Fangming Liu, and Jing Li. 2026. Modality-Decoupled Online Recursive Editing.arXiv preprint arXiv:2605.20273(2026)

Pith/arXiv arXiv 2026

[32] [32]

Mengdi Li, Jiaye Lin, Xufeng Zhao, Wenhao Lu, Peilin Zhao, Stefan Wermter, and Di Wang. 2025. Curriculum-rlaif: Curriculum alignment with reinforcement learning from ai feedback.arXiv preprint arXiv:2505.20075(2025)

Pith/arXiv arXiv 2025

[33] [33]

Zichao Li and Zong Ke. 2025. Cross-modal augmentation for low-resource language understanding and generation. InMAGMaR. 90–99

2025

[34] [34]

Yang Tian, Fan Liu, Jingyuan Zhang, Yupeng Hu, Liqiang Nie, et al. 2025. CoRe- MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG. InACL. 32967–32982

2025

[35] [35]

Leyang Li, Shilin Lu, Yan Ren, and Adams Wai-Kin Kong. 2025. Set you straight: Auto-steering denoising trajectories to sidestep unwanted concepts.arXiv preprint arXiv:2504.12782(2025)

arXiv 2025

[36] [36]

Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. 2024. CoVR: Learning composed video retrieval from web video captions. InAAAI, Vol. 38. 5270–5279

2024

[37] [37]

Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, et al. 2024. Composed video retrieval via enriched context and discriminative embeddings. InCVPR. 26896–26906

2024

[38] [38]

Yupeng Hu, Zixu Li, Zhiwei Chen, Qinlei Huang, Zhiheng Fu, Mingzhu Xu, and Liqiang Nie. 2026. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement.ACM ToMM(2026)

2026

[39] [39]

Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. 2024. CoVR- 2: Automatic Data Construction for Composed Video Retrieval.IEEE TPAMI (2024)

2024

[40] [40]

Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. 2026. ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval. InAAAI, Vol. 40. 23373– 23381

2026

[41] [41]

WU Yue, Zhaobo Qi, Yiling Wu, Junshu Sun, Yaowei Wang, and Shuhui Wang

[42] [42]

Learning Fine-Grained Representations through Textual Token Disentan- glement in Composed Video Retrieval. InICLR

[43] [43]

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan

[44] [44]

InACM MM

HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval. InACM MM. 6143–6152

[45] [45]

Zong Ke, Yuqing Cao, Zhenrui Chen, Yuchen Yin, Shouchao He, and Yu Cheng

[46] [46]

Finance Research Letters(2025), 107890

Early warning of cryptocurrency reversal risks via multi-source data. Finance Research Letters(2025), 107890

2025

[47] [47]

Haokun Wen, Xuemeng Song, Jianhua Yin, Jianlong Wu, Weili Guan, and Liqiang Nie. 2024. Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval.IEEE TPAMI46, 5 (2024), 3665–3678

2024

[48] [48]

Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, and Qing Li. 2025. KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints.arXiv preprint arXiv:2510.19316(2025)

Pith/arXiv arXiv 2025

[49] [49]

Ruanzhi Jiao, Jinlai Zhang, Chang Li, and Lin Hu. 2026. Large-kernel spatially parallel feature fusion for monocular 3D perception in autonomous driving. KBS343 (2026), 115998

2026

[50] [50]

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, et al. 2025. Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007(2025)

arXiv 2025

[51] [51]

Yuxuan Jiang et al. 2026. SCRIBE: Structured Mid-Level Supervision for Tool- Using Language Models.arXiv preprint arXiv:2601.03555(2026)

Pith/arXiv arXiv 2026

[52] [52]

Jinhe Bi, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, et al. 2026. EchoRL: Reinforcement Learning via Rollout Echoing.arXiv preprint arXiv:2605.31228(2026)

Pith/arXiv arXiv 2026

[53] [53]

Jincheng Huang, Lun Du, Xu Chen, Qiang Fu, et al . 2023. Robust mid-pass filtering graph convolutional networks. InACM WWW. 328–338

2023

[54] [54]

Gengyuan Zhang, Jinhe Bi, Jindong Gu, Yanyu Chen, and Volker Tresp. 2023. SPOT! Revisiting Video-Language Models for Event Understanding.arXiv preprint arXiv:2311.12919(2023)

arXiv 2023

[55] [55]

Jiazhao Shi, Yichen Lin, Yiheng Hua, Ziyu Wang, Zijian Zhang, Wenjia Zheng, Yun Song, Kuan Lu, and Shoufeng Lu. 2026. Multiscenario highway lane- change intention prediction: a physics-informed AI framework for three-class classification. InSTCE, Vol. 14120. SPIE, 129–145. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Huang et al

2026

[56] [56]

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, and Xunliang Cai. 2026. MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample- Efficient LLM Reasoning.arXiv preprint arXiv:2602.17550(2026)

Pith/arXiv arXiv 2026

[57] [57]

Jinlai Zhang, Mingchao Xiang, Yongheng Hu, Wei Hao, et al. 2026. Multivariate feature learning and associative spatial information enhancement for snow object detection in autonomous driving.EAAI175 (2026), 114672

2026

[58] [58]

Jincheng Huang, Jie Xu, Xiaoshuang Shi, Ping Hu, Lei Feng, et al. 2026. Revisiting Confidence Calibration for Misclassification Detection in VLMs. InICLR

2026

[59] [59]

Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, and Xiaobin Hu. 2025. Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling.arXiv preprint arXiv:2508.03404(2025)

arXiv 2025

[60] [60]

Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, and Liqiang Nie

[61] [61]

FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval.https://arxiv.org/abs/2503.21309(2025)

arXiv 2025

[62] [62]

Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang, Qinlei Huang, and Yinwei Wei. 2026. INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval. InAAAI, Vol. 40. 20463–20471

2026

[63] [63]

Zhiheng Fu, Yupeng Hu, Qianyun Yang, Shiqi Zhang, Zhiwei Chen, and Zixu Li

[64] [64]

Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval.arXiv preprint arXiv:2604.19386(2026)

Pith/arXiv arXiv 2026

[65] [65]

Zixu Li, Yupeng Hu, Zhiwei Chen, Mingyu Zhang, Zhiheng Fu, and Liqiang Nie

[66] [66]

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval.arXiv preprint arXiv:2604.20358(2026)

Pith/arXiv arXiv 2026

[67] [67]

Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint attribute manipulation and modality alignment learning for composing text and image to image retrieval. InACM MM. ACM, 3367–3376

2020

[68] [68]

Yuchen Yang, Min Wang, Wengang Zhou, and Houqiang Li. 2021. Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval. InACM MM. ACM, 3303–3311

2021

[69] [69]

Gangjian Zhang, Shikui Wei, Huaxin Pang, Shuang Qiu, and Yao Zhao. 2022. Composed Image Retrieval via Explicit Erasure and Replenishment With Se- mantic Alignment.IEEE TIP31 (2022), 5976–5988

2022

[70] [70]

Zixu Li, Yupeng Hu, Zhiwei Chen, Haokun Wen, Xuemeng Song, and Liqiang Nie. 2026. COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations.IEEE TIP(2026)

2026

[71] [71]

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, and Liqiang Nie. 2026. TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval.arXiv preprint arXiv:2604.21806(2026)

Pith/arXiv arXiv 2026

[72] [72]

Yida Zhao, Yuqing Song, and Qin Jin. 2022. Progressive learning for image retrieval with hybrid-modality queries. InSIGIR. 1012–1021

2022

[73] [73]

Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, and Shujuan Huang

[74] [74]

AMC: Adaptive Multi-Expert Collaborative Network for Text-Guided Image Retrieval.ACM ToMM(2023)

2023

[75] [75]

Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, and Zhao Yang

[76] [76]

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation.arXiv preprint arXiv:2605.09253(2026)

Pith/arXiv arXiv 2026

[77] [77]

Yuan Sun, Yang Qin, Yongxiang Li, Dezhong Peng, et al. 2024. Robust multi-view clustering with noisy correspondence.IEEE TKDE36, 12 (2024), 9150–9162

2024

[78] [78]

Zhiheng Fu, Zixu Li, Zhiwei Chen, Fangxu Liu, Yupeng Hu, Weili Guan, and Liqiang Nie. 2026. EgoAction: Egocentric Action Composition with Reliability- Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026.arXiv preprint arXiv:2605.24496(2026)

Pith/arXiv arXiv 2026

[79] [79]

Zixu Li, Yupeng Hu, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Weili Guan, and Liqiang Nie. 2026. TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge.arXiv preprint arXiv:2605.24470(2026)

Pith/arXiv arXiv 2026

[80] [80]

Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta, and Hengyuan Zhang