Recognition: unknown
INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval
Pith reviewed 2026-05-10 04:38 UTC · model grok-4.3
The pith
INTENT mitigates both cross-modal and modality-inherent noise in composed image retrieval using visual invariance and discriminative learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT) handles two types of noise in CIR: cross-modal correspondence noise through bi-objective discriminative learning that optimizes collaboratively with positive and negative samples and constructs a scalable decision boundary based on loyalty degree, and modality-inherent noise through Visual Invariant Composition that applies causal intervention via Fast Fourier Transform to generate intervened composed features enforcing visual invariance.
What carries the argument
Visual Invariant Composition component that performs causal intervention via Fast Fourier Transform on visual features to generate intervened composed features enforcing visual invariance.
Load-bearing premise
The assumption that Fast Fourier Transform-based causal intervention on the visual side produces intervened features that enforce visual invariance and filter modality-inherent noise without discarding useful compositional signals.
What would settle it
A controlled test that injects synthetic modality-inherent noise such as background variations into otherwise clean triplets and checks whether INTENT's FFT intervention measurably improves retrieval accuracy over an ablated version lacking the intervention.
Figures
read the original abstract
Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained modification annotations. However, modality-inherent noise is often overlooked, and research on cross-modal correspondence noise remains nascent. To tackle above issues, we propose the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT), comprising two components: Visual Invariant Composition and Bi-Objective Discriminative Learning, specifically designed to handle the two-aspect noise. The former applies causal intervention on the visual side via Fast Fourier Transform (FFT) to generate intervened composed features, enforcing visual invariance and enabling the model to ignore modality-inherent noise during composition. The latter adopts collaborative optimization with both positive and negative samples, and constructs a scalable decision boundary that dynamically adjusts decisions based on the loyalty degree, enabling robust correspondence discrimination. Extensive experiments on two widely used benchmark datasets demonstrate the superiority and robustness of INTENT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses noisy triplet correspondences in Composed Image Retrieval (CIR), categorizing noise into cross-modal correspondence noise (modality mismatches) and modality-inherent noise (intra-modal backgrounds or factors irrelevant to coarse text annotations). It proposes INTENT with two components: Visual Invariant Composition, which applies FFT-based causal intervention on visual features to produce intervened composed representations that enforce invariance and allow ignoring modality-inherent noise, and Bi-Objective Discriminative Learning, which performs collaborative optimization over positive and negative samples while constructing a loyalty-degree-adjusted decision boundary for robust discrimination. Experiments on two standard CIR benchmarks are reported to demonstrate improved retrieval performance and robustness.
Significance. If the FFT intervention and bi-objective learning components deliver the claimed invariance and discrimination properties, the work would usefully extend CIR research toward practical robustness against annotation noise, an issue that is common yet under-addressed. The explicit separation of noise types and the frequency-domain approach to visual invariance constitute a concrete technical contribution that could influence subsequent multimodal retrieval methods.
major comments (2)
- [§3.2] Visual Invariant Composition (method section): the claim that FFT constitutes a causal intervention generating intervened composed features that enforce visual invariance is load-bearing for the modality-inherent noise robustness argument. The manuscript must specify the underlying causal graph, the precise do-operator realization (which frequencies are treated as the noise variable, whether via amplitude masking, phase randomization, or replacement), and why the operation is not merely a heuristic filter. Without this, the invariance guarantee does not follow for cases where background correlates spuriously with coarse text.
- [§3.3] Bi-Objective Discriminative Learning (method section): the collaborative optimization and loyalty-degree dynamic boundary are presented as enabling robust correspondence discrimination, yet the paper should demonstrate via targeted ablations that this component isolates cross-modal noise handling rather than simply improving overall fitting. The interaction between the two objectives and any additional hyperparameters introduced by the loyalty mechanism must be shown not to undermine the claimed parameter efficiency.
minor comments (2)
- The abstract states that experiments demonstrate superiority but does not report concrete recall@K or mAP deltas relative to the strongest baselines; these numbers should appear in the abstract or a summary table for immediate assessment.
- [§3.3] Notation for the loyalty degree and the resulting decision boundary should be introduced with explicit equations rather than descriptive text alone to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which help strengthen the technical foundations of our work. We address each major comment in detail below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [§3.2] Visual Invariant Composition (method section): the claim that FFT constitutes a causal intervention generating intervened composed features that enforce visual invariance is load-bearing for the modality-inherent noise robustness argument. The manuscript must specify the underlying causal graph, the precise do-operator realization (which frequencies are treated as the noise variable, whether via amplitude masking, phase randomization, or replacement), and why the operation is not merely a heuristic filter. Without this, the invariance guarantee does not follow for cases where background correlates spuriously with coarse text.
Authors: We agree that the current description of the FFT-based intervention would benefit from greater formalization to rigorously support the invariance claims. In the revised manuscript we will add to §3.2: (i) an explicit causal graph in which modality-inherent noise (background factors uncorrelated with the modification text) acts as a confounder on the visual feature extractor; (ii) the precise do-operator implementation realized by amplitude masking—specifically, replacing the amplitude spectrum of low-frequency bins identified as noise carriers with dataset-wide mean amplitudes while retaining the original phase to preserve semantic content; and (iii) a short theoretical argument showing that, under the stated graph, the intervention removes the back-door path from noise to the composed representation, distinguishing the approach from a purely heuristic filter. These additions will directly address the concern about spurious correlations. revision: yes
-
Referee: [§3.3] Bi-Objective Discriminative Learning (method section): the collaborative optimization and loyalty-degree dynamic boundary are presented as enabling robust correspondence discrimination, yet the paper should demonstrate via targeted ablations that this component isolates cross-modal noise handling rather than simply improving overall fitting. The interaction between the two objectives and any additional hyperparameters introduced by the loyalty mechanism must be shown not to undermine the claimed parameter efficiency.
Authors: We accept that targeted evidence is required to isolate the noise-specific benefit. In the revision we will insert new ablation studies that (a) evaluate the bi-objective loss on controlled subsets containing only cross-modal mismatches (synthetically introduced) versus clean data, and (b) compare against single-objective variants to quantify the incremental gain attributable to collaborative positive/negative optimization. We will also report a hyperparameter sweep for the loyalty-degree scalar, demonstrating that it adds only one learnable scalar per mini-batch (re-using existing similarity scores) and does not increase overall parameter count or training time beyond 2 %. The interaction between the two objectives will be illustrated via training curves that separate the contribution of each term under noisy conditions. revision: yes
Circularity Check
No circularity: method proposal is self-contained empirical design
full rationale
The paper presents INTENT as a new architecture with two explicitly described components for NTC noise handling. Visual Invariant Composition is introduced as an application of FFT-based intervention (a design choice, not a derived prediction), and Bi-Objective Discriminative Learning is described as collaborative optimization with loyalty-based boundaries. No equations, derivations, or first-principles results appear that reduce any claimed invariance or discrimination property to fitted parameters, self-definitions, or prior self-citations by construction. The central claims rest on the proposed modules' behavior on benchmarks rather than any tautological reduction. This is the expected non-finding for a methods paper whose contributions are architectural and empirical.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Noise in CIR datasets can be categorized into cross-modal correspondence noise and modality-inherent noise.
Reference graph
Works this paper leans on
-
[1]
Wang, Y .; Fu, T.; Xu, Y .; Ma, Z.; Xu, H.; Du, B.; Lu, Y .; Gao, H.; Wu, J.; and Chen, J. 2024. TWIN-GPT: digital twins for clinical trials via large language model.ACM ToMM
2024
-
[2]
Ge, J.; Cao, J.; Li, X.; Zhu, X.; Liu, C.; Liu, B.; Feng, C.; and Patras, I. 2025. Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly- Supervised Camouflaged Object Detection with Scribble Annotations.arXiv preprint arXiv:2512.20260
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Wang, B.; Li, W.; and Ge, J. 2025. R1-Track: Direct Appli- RetrievalResultsINTENT TME Multimodal Query “has larger graphics and has smaller lettering on it“ W/O VIC RetrievalResultsINTENT TME Multimodal Query “multi colored wihwaist tie and is blue and white with longer sleeves“ W/O VIC RetrievalResultsINTENT TME Multimodal Query “is shorter and has anima...
- [4]
- [5]
-
[6]
Ni, C.; Zhao, G.; Wang, X.; Zhu, Z.; Qin, W.; Huang, G.; Liu, C.; Chen, Y .; Wang, Y .; Zhang, X.; et al. 2025. Re- condreamer: Crafting world models for driving scene recon- struction via online restoration. InCVPR, 1559–1569
2025
-
[7]
Xu, X.; Liu, Y .; Khan, S.; Khan, F.; Zuo, W.; Goh, R. S. M.; Feng, C.-M.; et al. 2024. Sentence-level Prompts Benefit Composed Image Retrieval. InICLR
2024
-
[8]
Wen, H.; Song, X.; Yin, J.; Wu, J.; Guan, W.; and Nie, L
-
[9]
Self-Training Boosted Multi-Factor Matching Net- work for Composed Image Retrieval.IEEE TPAMI
-
[10]
Yang, X.; Liu, D.; Zhang, H.; Luo, Y .; Wang, C.; and Zhang, J. 2024. Decomposing Semantic Shifts for Composed Image Retrieval. InAAAI, volume 38, 6576–6584
2024
-
[11]
Li, Z.; Hu, Y .; Chen, Z.; Zhang, S.; Huang, Q.; Fu, Z.; and Wei, Y . 2026. HABIT: Chrono-Synergia Robust Progres- sive Learning Framework for Composed Image Retrieval. InAAAI, volume 40, 6762–6770
2026
- [12]
- [13]
-
[14]
Chen, K.; Fang, P.; and Xue, H. 2025. DePro: Domain Ensemble using Decoupled Prompts for Universal Cross- Domain Retrieval. InACM SIGIR, SIGIR ’25, 958–967
2025
-
[15]
The pelican is swimming instead of standing and the color has changed to white
Chen, K.; Fang, P.; and Xue, H. 2025. Multi-Modal Inter- active Agent Layer for Few-Shot Universal Cross-Domain Retrieval and Beyond. InNeurIPS. RetrievalResultsINTENT TME Multimodal Query “The pelican is swimming instead of standing and the color has changed to white” W/O VIC RetrievalResultsINTENT TME Multimodal Query “Change to a barber shop and includ...
2025
-
[16]
Li, Z.; Hu, Y .; Chen, Z.; Huang, Q.; Qiu, G.; Fu, Z.; and Liu, M. 2026. ReTrack: Evidence-Driven Dual-Stream Di- rectional Anchor Calibration Network for Composed Video Retrieval. InAAAI, volume 40, 23373–23381
2026
-
[17]
Hu, Y .; Li, Z.; Chen, Z.; Huang, Q.; Fu, Z.; Xu, M.; and Nie, L. 2026. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement.ACM ToMM
2026
- [18]
-
[19]
Lan, Y .; Xu, S.; Su, C.; Ye, R.; Peng, D.; and Sun, Y . 2025. Multi-view Hashing Classification. InACM MM, 2122– 2130
2025
-
[20]
Liu, J.; Zhuo, D.; Feng, Z.; Zhu, S.; Peng, C.; Liu, Z.; and Wang, H. 2024. Dvlo: Deep visual-lidar odometry with local-to-global feature fusion and bi-directional structure alignment. InECCV, 475–493. Springer
2024
-
[21]
Jiang, L.; Wang, X.; Zhang, F.; and Zhang, C. 2025. Trans- forming time and space: efficient video super-resolution with hybrid attention and deformable transformers.The Vi- sual Computer, 1–12
2025
- [22]
- [23]
-
[24]
S.; Sheng, Z.; and Yang, B
Qiu, X.; Hu, J.; Zhou, L.; Wu, X.; Du, J.; Zhang, B.; Guo, C.; Zhou, A.; Jensen, C. S.; Sheng, Z.; and Yang, B. 2024. TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods. InProc. VLDB Endow., 2363–2377
2024
-
[25]
Feng, C.; and Patras, I. 2023. MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset. InCVPR
2023
-
[26]
Zhou, S.; Cao, Y .; Nie, J.; Fu, Y .; Zhao, Z.; Lu, X.; and Wang, S. 2026. Comptrack: Information bottleneck-guided low-rank dynamic token compression for point cloud track- ing. InAAAI, volume 40, 13773–13781
2026
-
[27]
Yuan, H.; Li, X.; Dai, J.; You, X.; Sun, Y .; and Ren, Z. 2025. Deep Streaming View Clustering. InICML
2025
-
[28]
Qiu, X.; Wu, X.; Lin, Y .; Guo, C.; Hu, J.; and Yang, B. 2025. DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting. InSIGKDD, 1185–1196
2025
-
[29]
He, C.; Xue, D.; Li, S.; Hao, Y .; Peng, X.; and Hu, P
-
[30]
Bootstrapping Multi-view Learning for Test-time Noisy Correspondence. InCVPR
-
[31]
Zhang, F.; Gu, Z.; and Wang, H. 2026. Decoding with struc- tured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation. In AAAI, volume 40, 12421–12429
2026
- [32]
-
[33]
Zhou, S.; Li, L.; Zhang, X.; Zhang, B.; Bai, S.; Sun, M.; Zhao, Z.; Lu, X.; and Chu, X. 2024. LiDAR-PTQ: Post- Training Quantization for Point Cloud 3D Object Detection
2024
-
[34]
Wen, J.; Cui, J.; Zhao, Z.; Yan, R.; Gao, Z.; Dou, L.; and Chen, B. M. 2023. SyreaNet: A Physically Guided Under- water Image Enhancement Framework Integrating Synthetic and Real Images. InICRA, 5177–5183
2023
- [35]
- [36]
-
[37]
Wang, H.; and Zhang, F. 2024. Computing nodes for plane data points by constructing cubic polynomial with con- straints.Computer Aided Geometric Design, 111: 102308
2024
-
[38]
Yu, Z.; Idris, M. Y . I.; and Wang, P. 2025. Visualizing Our Changing Earth: A Creative AI Framework for Democratiz- ing Environmental Storytelling Through Satellite Imagery. InNeurIPS 2025
2025
-
[39]
Bi, J.; Wang, Y .; Chen, H.; Xiao, X.; Hecker, A.; Tresp, V .; and Ma, Y . 2025. LLaV A steering: Visual instruction tuning with 500x fewer parameters through modality linear representation-steering. InACL, 15230–15250
2025
-
[40]
Li, H.; Zhao, J.; Bazin, J.-C.; Kim, P.; Joo, K.; Zhao, Z.; and Liu, Y .-H. 2023. Hong kong world: Leveraging structural regularity for line-based slam.IEEE TPAMI, 45(11): 13035– 13053
2023
-
[41]
Duan, S.; Wu, W.; Hu, P.; Ren, Z.; Peng, D.; and Sun, Y
-
[42]
CoPINN: Cognitive physics-informed neural net- works. InICML
-
[43]
Yu, Z.; Wang, J.; and Idris, M. Y . I. 2025. IIDM: Improved implicit diffusion model with knowledge distillation to es- timate the spatial distribution density of carbon stock in re- mote sensing imagery.KBS, 115131
2025
-
[44]
Cheng, Z.; Lai, L.; Liu, Y .; Cheng, K.; and Qi, X. 2026. En- hancing Financial Report Question-Answering: A Retrieval- Augmented Generation System with Reranking Analysis. arXiv preprint arXiv:2603.16877
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [45]
-
[46]
Jia, S.; Zhu, N.; Zhong, J.; Zhou, J.; Zhang, H.; Hwang, J.- N.; and Li, L. 2026. RAM: Recover Any 3D Human Motion in-the-Wild. arXiv:2603.19929
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
Li, L.; Jia, S.; Wang, J.; Jiang, Z.; Zhou, F.; Dai, J.; Zhang, T.; Wu, Z.; and Hwang, J.-N. 2025. Human Motion Instruc- tion Tuning. InCVPR
2025
-
[48]
Li, L.; Jia, S.; and Hwang, J.-N. 2026. Multiple Human Motion Understanding. InAAAI, volume 40, 6297–6305
2026
- [49]
-
[50]
Yang, Q.; Lv, P.; Li, Y .; Zhang, S.; Chen, Y .; Chen, Z.; Li, Z.; and Hu, Y . 2026. ERASE: Bypassing Collaborative Detec- tion of AI Counterfeit Via Comprehensive Artifacts Elimi- nation.IEEE TDSC, 1–18
2026
-
[51]
T.; Peng, X.; and Hu, P
Li, S.; He, C.; Liu, X.; Zhou, J. T.; Peng, X.; and Hu, P. 2025. Learning with Noisy Triplet Correspondence for Composed Image Retrieval. InCVPR, 19628–19637
2025
-
[52]
Huang, F.; Zhang, L.; Fu, X.; and Song, S. 2024. Dynamic weighted combiner for mixed-modal image retrieval. In AAAI, volume 38, 2303–2311
2024
-
[53]
Li, Z.; Chen, Z.; Wen, H.; Fu, Z.; Hu, Y .; and Guan, W
-
[54]
ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval. InAAAI
-
[55]
Dang, Z.; Luo, M.; Wang, J.; Jia, C.; Han, H.; Wan, H.; Dai, G.; Chang, X.; and Wang, J. 2025. Disentangled noisy cor- respondence learning.IEEE TIP
2025
-
[56]
Han, H.; Miao, K.; Zheng, Q.; and Luo, M. 2023. Noisy correspondence learning with meta similarity correction. In CVPR, 7517–7526
2023
-
[57]
Huang, Z.; Niu, G.; Liu, X.; Ding, W.; Xiao, X.; Wu, H.; and Peng, X. 2021. Learning with noisy correspondence for cross-modal matching.NeurIPS, 34: 29406–29419
2021
-
[58]
Tan, C.; Xia, J.; Wu, L.; and Li, S. Z. 2021. Co-learning: Learning from noisy labels with self-supervision. InACM MM, 1405–1413
2021
-
[59]
Batson, J.; and Royer, L. 2019. Noise2self: Blind denoising by self-supervision. InICML, 524–533. PMLR
2019
-
[60]
V o, N.; Jiang, L.; Sun, C.; Murphy, K.; Li, L.; Fei-Fei, L.; and Hays, J. 2019. Composing Text and Image for Image Retrieval - an Empirical Odyssey. InCVPR, 6439–6448. IEEE
2019
-
[61]
Wen, H.; Song, X.; Yang, X.; Zhan, Y .; and Nie, L. 2021. Comprehensive Linguistic-Visual Composition Network for Image Retrieval. InACM SIGIR, 1369–1378. ACM
2021
-
[62]
W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. InICML, 8748–8763. PMLR
2021
- [63]
- [64]
-
[65]
Gu, R.; Jia, S.; Ma, Y .; Zhong, J.; Hwang, J.-N.; and Li, L
-
[66]
InACM MM, 9026–9034
MoCount: Motion-Based Repetitive Action Counting. InACM MM, 9026–9034
- [67]
- [68]
- [69]
- [70]
-
[71]
Xie, Z.; Wang, C.; Wang, Y .; Cai, S.; Wang, S.; and Jin, T
-
[72]
InEMNLP, 5259–5270
Chat-driven text generation and interaction for person retrieval. InEMNLP, 5259–5270
-
[73]
Hu, Y .; Song, Z.; Feng, N.; Luo, Y .; Yu, J.; Chen, Y .- P. P.; and Yang, W. 2025. SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understand- ing.arXiv preprint arXiv:2504.07745
-
[74]
Li, X.; Ma, Y .; Huang, Y .; Wang, X.; Lin, Y .; and Zhang, C. 2024. Synergized data efficiency and compression (sec) optimization for large language models. InEIECS, 586–591. IEEE
2024
-
[75]
Zeng, Y .; Yu, W.; Li, Z.; Ren, T.; Ma, Y .; Cao, J.; Chen, X.; and Yu, T. 2025. Bridging the editing gap in LLMs: Fi- neEdit for precise and targeted text modifications.EMNLP Findings, 2193–2206
2025
- [76]
-
[77]
Sun, Y .; Li, Y .; Ren, Z.; Duan, G.; Peng, D.; and Hu, P. 2025. Roll: Robust noisy pseudo-label learning for multi-view clustering with noisy correspondence. InCVPR, 30732– 30741
2025
-
[78]
Feng, C.; Tzimiropoulos, G.; and Patras, I. 2022. SSR: An Efficient and Robust Framework for Learning with Un- known Label Noise. InBMVC
2022
-
[79]
He, C.; Zhu, H.; Hu, P.; and Peng, X. 2024. Robust Vari- ational Contrastive Learning for Partially View-unaligned Clustering. InACM MM, 4167–4176
2024
-
[80]
Feng, C.; Tzimiropoulos, G.; and Patras, I. 2024. NoiseBox: Towards More Efficient and Effective Learning with Noisy Labels.IEEE TCSVT
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.