arxiv: 2604.18051 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval

Zhiwei Chen , Yupeng Hu , Zhiheng Fu , Zixu Li , Jiale Huang , Qinlei Huang , Yinwei Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords composed image retrievalnoisy triplet correspondencevisual invariancefast fourier transformdiscriminative learningnoise mitigationmultimodal retrieval

0 comments

The pith

INTENT mitigates both cross-modal and modality-inherent noise in composed image retrieval using visual invariance and discriminative learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Composed image retrieval matches a reference image to a target via a modification text, yet real datasets contain annotation errors that produce noisy triplets. The paper separates this into cross-modal correspondence noise from mismatches and modality-inherent noise from intra-image factors such as background clutter. INTENT counters the first through bi-objective optimization on positive and negative samples plus loyalty-adjusted decision boundaries, while the second is handled by applying causal intervention via Fast Fourier Transform on visual features to produce invariant composed representations. This dual handling yields more robust retrieval than prior methods that addressed only one noise type. Readers focused on practical deployment see value in a system that tolerates imperfect labeling without requiring costly re-annotation.

Core claim

The central claim is that the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT) handles two types of noise in CIR: cross-modal correspondence noise through bi-objective discriminative learning that optimizes collaboratively with positive and negative samples and constructs a scalable decision boundary based on loyalty degree, and modality-inherent noise through Visual Invariant Composition that applies causal intervention via Fast Fourier Transform to generate intervened composed features enforcing visual invariance.

What carries the argument

Visual Invariant Composition component that performs causal intervention via Fast Fourier Transform on visual features to generate intervened composed features enforcing visual invariance.

Load-bearing premise

The assumption that Fast Fourier Transform-based causal intervention on the visual side produces intervened features that enforce visual invariance and filter modality-inherent noise without discarding useful compositional signals.

What would settle it

A controlled test that injects synthetic modality-inherent noise such as background variations into otherwise clean triplets and checks whether INTENT's FFT intervention measurably improves retrieval accuracy over an ablated version lacking the intervention.

Figures

Figures reproduced from arXiv: 2604.18051 by Jiale Huang, Qinlei Huang, Yinwei Wei, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Zixu Li.

**Figure 2.** Figure 2: The causal graph of CIR. Solid arrows present the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The framework of our proposed INTENT. We designed (a) Visual Invariant Composition and (b) Bi-Objective Dis [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Case Study on (a) CIRR and (b) FashionIQ. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the effects of different interven [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Similarity (left) and loyalty degree (right) distribu [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of similarity matrices between TME [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Case study on FashionIQ dataset. cation of MLLMs to Visual Object Tracking via Reinforcement Learning. arXiv:2506.21980. [4] Ni, C.; Wang, X.; Zhu, Z.; Wang, W.; Li, H.; Zhao, G.; Li, J.; Qin, W.; Huang, G.; and Mei, W. 2025. Wonderturbo: Generating interactive 3d world in 0.72 seconds. arXiv preprint arXiv:2504.02261. [5] Ni, C.; Zhao, G.; Wang, X.; Zhu, Z.; Qin, W.; Chen, X.; Jia, G.; Huang, G.; and Me… view at source ↗

**Figure 9.** Figure 9: Case study on CIRR dataset. [15] Li, Z.; Hu, Y.; Chen, Z.; Huang, Q.; Qiu, G.; Fu, Z.; and Liu, M. 2026. ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval. In AAAI, volume 40, 23373–23381. [16] Hu, Y.; Li, Z.; Chen, Z.; Huang, Q.; Fu, Z.; Xu, M.; and Nie, L. 2026. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement. ACM … view at source ↗

read the original abstract

Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained modification annotations. However, modality-inherent noise is often overlooked, and research on cross-modal correspondence noise remains nascent. To tackle above issues, we propose the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT), comprising two components: Visual Invariant Composition and Bi-Objective Discriminative Learning, specifically designed to handle the two-aspect noise. The former applies causal intervention on the visual side via Fast Fourier Transform (FFT) to generate intervened composed features, enforcing visual invariance and enabling the model to ignore modality-inherent noise during composition. The latter adopts collaborative optimization with both positive and negative samples, and constructs a scalable decision boundary that dynamically adjusts decisions based on the loyalty degree, enabling robust correspondence discrimination. Extensive experiments on two widely used benchmark datasets demonstrate the superiority and robustness of INTENT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

INTENT splits CIR noise into visual and correspondence types then applies FFT intervention plus loyalty-adjusted discrimination, but the causal status of the FFT step is the main open question.

read the letter

The paper's core move is to treat noise in composed image retrieval as two distinct problems: cross-modal triplet mismatches and modality-inherent visual factors like background clutter that the coarse text does not address. INTENT handles the first with a bi-objective learner that uses both positive and negative samples and scales the decision boundary by a loyalty score, and the second with an FFT-based visual composition step that produces intervened features meant to enforce invariance.

Referee Report

2 major / 2 minor

Summary. The paper addresses noisy triplet correspondences in Composed Image Retrieval (CIR), categorizing noise into cross-modal correspondence noise (modality mismatches) and modality-inherent noise (intra-modal backgrounds or factors irrelevant to coarse text annotations). It proposes INTENT with two components: Visual Invariant Composition, which applies FFT-based causal intervention on visual features to produce intervened composed representations that enforce invariance and allow ignoring modality-inherent noise, and Bi-Objective Discriminative Learning, which performs collaborative optimization over positive and negative samples while constructing a loyalty-degree-adjusted decision boundary for robust discrimination. Experiments on two standard CIR benchmarks are reported to demonstrate improved retrieval performance and robustness.

Significance. If the FFT intervention and bi-objective learning components deliver the claimed invariance and discrimination properties, the work would usefully extend CIR research toward practical robustness against annotation noise, an issue that is common yet under-addressed. The explicit separation of noise types and the frequency-domain approach to visual invariance constitute a concrete technical contribution that could influence subsequent multimodal retrieval methods.

major comments (2)

[§3.2] Visual Invariant Composition (method section): the claim that FFT constitutes a causal intervention generating intervened composed features that enforce visual invariance is load-bearing for the modality-inherent noise robustness argument. The manuscript must specify the underlying causal graph, the precise do-operator realization (which frequencies are treated as the noise variable, whether via amplitude masking, phase randomization, or replacement), and why the operation is not merely a heuristic filter. Without this, the invariance guarantee does not follow for cases where background correlates spuriously with coarse text.
[§3.3] Bi-Objective Discriminative Learning (method section): the collaborative optimization and loyalty-degree dynamic boundary are presented as enabling robust correspondence discrimination, yet the paper should demonstrate via targeted ablations that this component isolates cross-modal noise handling rather than simply improving overall fitting. The interaction between the two objectives and any additional hyperparameters introduced by the loyalty mechanism must be shown not to undermine the claimed parameter efficiency.

minor comments (2)

The abstract states that experiments demonstrate superiority but does not report concrete recall@K or mAP deltas relative to the strongest baselines; these numbers should appear in the abstract or a summary table for immediate assessment.
[§3.3] Notation for the loyalty degree and the resulting decision boundary should be introduced with explicit equations rather than descriptive text alone to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help strengthen the technical foundations of our work. We address each major comment in detail below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§3.2] Visual Invariant Composition (method section): the claim that FFT constitutes a causal intervention generating intervened composed features that enforce visual invariance is load-bearing for the modality-inherent noise robustness argument. The manuscript must specify the underlying causal graph, the precise do-operator realization (which frequencies are treated as the noise variable, whether via amplitude masking, phase randomization, or replacement), and why the operation is not merely a heuristic filter. Without this, the invariance guarantee does not follow for cases where background correlates spuriously with coarse text.

Authors: We agree that the current description of the FFT-based intervention would benefit from greater formalization to rigorously support the invariance claims. In the revised manuscript we will add to §3.2: (i) an explicit causal graph in which modality-inherent noise (background factors uncorrelated with the modification text) acts as a confounder on the visual feature extractor; (ii) the precise do-operator implementation realized by amplitude masking—specifically, replacing the amplitude spectrum of low-frequency bins identified as noise carriers with dataset-wide mean amplitudes while retaining the original phase to preserve semantic content; and (iii) a short theoretical argument showing that, under the stated graph, the intervention removes the back-door path from noise to the composed representation, distinguishing the approach from a purely heuristic filter. These additions will directly address the concern about spurious correlations. revision: yes
Referee: [§3.3] Bi-Objective Discriminative Learning (method section): the collaborative optimization and loyalty-degree dynamic boundary are presented as enabling robust correspondence discrimination, yet the paper should demonstrate via targeted ablations that this component isolates cross-modal noise handling rather than simply improving overall fitting. The interaction between the two objectives and any additional hyperparameters introduced by the loyalty mechanism must be shown not to undermine the claimed parameter efficiency.

Authors: We accept that targeted evidence is required to isolate the noise-specific benefit. In the revision we will insert new ablation studies that (a) evaluate the bi-objective loss on controlled subsets containing only cross-modal mismatches (synthetically introduced) versus clean data, and (b) compare against single-objective variants to quantify the incremental gain attributable to collaborative positive/negative optimization. We will also report a hyperparameter sweep for the loyalty-degree scalar, demonstrating that it adds only one learnable scalar per mini-batch (re-using existing similarity scores) and does not increase overall parameter count or training time beyond 2 %. The interaction between the two objectives will be illustrated via training curves that separate the contribution of each term under noisy conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: method proposal is self-contained empirical design

full rationale

The paper presents INTENT as a new architecture with two explicitly described components for NTC noise handling. Visual Invariant Composition is introduced as an application of FFT-based intervention (a design choice, not a derived prediction), and Bi-Objective Discriminative Learning is described as collaborative optimization with loyalty-based boundaries. No equations, derivations, or first-principles results appear that reduce any claimed invariance or discrimination property to fitted parameters, self-definitions, or prior self-citations by construction. The central claims rest on the proposed modules' behavior on benchmarks rather than any tautological reduction. This is the expected non-finding for a methods paper whose contributions are architectural and empirical.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that noise in CIR cleanly separates into cross-modal correspondence noise and modality-inherent noise, each addressable by one dedicated component.

axioms (1)

domain assumption Noise in CIR datasets can be categorized into cross-modal correspondence noise and modality-inherent noise.
Explicitly stated in the abstract as the foundation for designing the two components of INTENT.

pith-pipeline@v0.9.0 · 5605 in / 1235 out tokens · 29461 ms · 2026-05-10T04:38:45.833156+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

129 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Wang, Y .; Fu, T.; Xu, Y .; Ma, Z.; Xu, H.; Du, B.; Lu, Y .; Gao, H.; Wu, J.; and Chen, J. 2024. TWIN-GPT: digital twins for clinical trials via large language model.ACM ToMM

2024
[2]

Ge, J.; Cao, J.; Li, X.; Zhu, X.; Liu, C.; Liu, B.; Feng, C.; and Patras, I. 2025. Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly- Supervised Camouflaged Object Detection with Scribble Annotations.arXiv preprint arXiv:2512.20260

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Wang, B.; Li, W.; and Ge, J. 2025. R1-Track: Direct Appli- RetrievalResultsINTENT TME Multimodal Query “has larger graphics and has smaller lettering on it“ W/O VIC RetrievalResultsINTENT TME Multimodal Query “multi colored wihwaist tie and is blue and white with longer sleeves“ W/O VIC RetrievalResultsINTENT TME Multimodal Query “is shorter and has anima...

work page arXiv 2025
[4]

Ni, C.; Wang, X.; Zhu, Z.; Wang, W.; Li, H.; Zhao, G.; Li, J.; Qin, W.; Huang, G.; and Mei, W. 2025. Wonderturbo: Gen- erating interactive 3d world in 0.72 seconds.arXiv preprint arXiv:2504.02261

work page arXiv 2025
[5]

Ni, C.; Zhao, G.; Wang, X.; Zhu, Z.; Qin, W.; Chen, X.; Jia, G.; Huang, G.; and Mei, W. 2025. Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction.arXiv preprint arXiv:2508.08170

work page arXiv 2025
[6]

Ni, C.; Zhao, G.; Wang, X.; Zhu, Z.; Qin, W.; Huang, G.; Liu, C.; Chen, Y .; Wang, Y .; Zhang, X.; et al. 2025. Re- condreamer: Crafting world models for driving scene recon- struction via online restoration. InCVPR, 1559–1569

2025
[7]

Xu, X.; Liu, Y .; Khan, S.; Khan, F.; Zuo, W.; Goh, R. S. M.; Feng, C.-M.; et al. 2024. Sentence-level Prompts Benefit Composed Image Retrieval. InICLR

2024
[8]

Wen, H.; Song, X.; Yin, J.; Wu, J.; Guan, W.; and Nie, L
[9]

Self-Training Boosted Multi-Factor Matching Net- work for Composed Image Retrieval.IEEE TPAMI
[10]

Yang, X.; Liu, D.; Zhang, H.; Luo, Y .; Wang, C.; and Zhang, J. 2024. Decomposing Semantic Shifts for Composed Image Retrieval. InAAAI, volume 38, 6576–6584

2024
[11]

Li, Z.; Hu, Y .; Chen, Z.; Zhang, S.; Huang, Q.; Fu, Z.; and Wei, Y . 2026. HABIT: Chrono-Synergia Robust Progres- sive Learning Framework for Composed Image Retrieval. InAAAI, volume 40, 6762–6770

2026
[12]

Zhang, M.; Li, Z.; Chen, Z.; Fu, Z.; Zhu, X.; Nie, J.; Wei, Y .; and Hu, Y . 2026. Hint: Composed image retrieval with dual- path compositional contextualized network.arXiv preprint arXiv:2603.26341

work page arXiv 2026
[13]

Qiu, G.; Chen, Z.; Li, Z.; Huang, Q.; Fu, Z.; Song, X.; and Hu, Y . 2026. MELT: Improve Composed Image Re- trieval via the Modification Frequentation-Rarity Balance Network.arXiv preprint arXiv:2603.29291

work page arXiv 2026
[14]

Chen, K.; Fang, P.; and Xue, H. 2025. DePro: Domain Ensemble using Decoupled Prompts for Universal Cross- Domain Retrieval. InACM SIGIR, SIGIR ’25, 958–967

2025
[15]

The pelican is swimming instead of standing and the color has changed to white

Chen, K.; Fang, P.; and Xue, H. 2025. Multi-Modal Inter- active Agent Layer for Few-Shot Universal Cross-Domain Retrieval and Beyond. InNeurIPS. RetrievalResultsINTENT TME Multimodal Query “The pelican is swimming instead of standing and the color has changed to white” W/O VIC RetrievalResultsINTENT TME Multimodal Query “Change to a barber shop and includ...

2025
[16]

Li, Z.; Hu, Y .; Chen, Z.; Huang, Q.; Qiu, G.; Fu, Z.; and Liu, M. 2026. ReTrack: Evidence-Driven Dual-Stream Di- rectional Anchor Calibration Network for Composed Video Retrieval. InAAAI, volume 40, 23373–23381

2026
[17]

Hu, Y .; Li, Z.; Chen, Z.; Huang, Q.; Fu, Z.; Xu, M.; and Nie, L. 2026. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement.ACM ToMM

2026
[18]

Yang, Q.; Chen, Z.; Hu, Y .; Li, Z.; Fu, Z.; and Nie, L. 2026. STABLE: Efficient Hybrid Nearest Neighbor Search via Magnitude-Uniformity and Cardinality-Robustness.arXiv preprint arXiv:2604.01617

work page arXiv 2026
[19]

Lan, Y .; Xu, S.; Su, C.; Ye, R.; Peng, D.; and Sun, Y . 2025. Multi-view Hashing Classification. InACM MM, 2122– 2130

2025
[20]

Liu, J.; Zhuo, D.; Feng, Z.; Zhu, S.; Peng, C.; Liu, Z.; and Wang, H. 2024. Dvlo: Deep visual-lidar odometry with local-to-global feature fusion and bi-directional structure alignment. InECCV, 475–493. Springer

2024
[21]

Jiang, L.; Wang, X.; Zhang, F.; and Zhang, C. 2025. Trans- forming time and space: efficient video super-resolution with hybrid attention and deformable transformers.The Vi- sual Computer, 1–12

2025
[22]

Bi, J.; Yan, D.; Wang, Y .; Huang, W.; Chen, H.; Wan, G.; Ye, M.; Xiao, X.; Schuetze, H.; Tresp, V .; et al. 2025. CoT- Kinetics: A Theoretical Modeling Assessing LRM Reason- ing Process.arXiv preprint arXiv:2505.13408

work page arXiv 2025
[23]

Zhou, S.; Yuan, Z.; Yang, D.; Zhao, Z.; Hu, X.; Shi, Y .; Lu, X.; and Wu, Q. 2024. Information Entropy Guided Height- aware Histogram for Quantization-friendly Pillar Feature Encoder.arXiv preprint arXiv:2405.18734

work page arXiv 2024
[24]

S.; Sheng, Z.; and Yang, B

Qiu, X.; Hu, J.; Zhou, L.; Wu, X.; Du, J.; Zhang, B.; Guo, C.; Zhou, A.; Jensen, C. S.; Sheng, Z.; and Yang, B. 2024. TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods. InProc. VLDB Endow., 2363–2377

2024
[25]

Feng, C.; and Patras, I. 2023. MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset. InCVPR

2023
[26]

Zhou, S.; Cao, Y .; Nie, J.; Fu, Y .; Zhao, Z.; Lu, X.; and Wang, S. 2026. Comptrack: Information bottleneck-guided low-rank dynamic token compression for point cloud track- ing. InAAAI, volume 40, 13773–13781

2026
[27]

Yuan, H.; Li, X.; Dai, J.; You, X.; Sun, Y .; and Ren, Z. 2025. Deep Streaming View Clustering. InICML

2025
[28]

Qiu, X.; Wu, X.; Lin, Y .; Guo, C.; Hu, J.; and Yang, B. 2025. DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting. InSIGKDD, 1185–1196

2025
[29]

He, C.; Xue, D.; Li, S.; Hao, Y .; Peng, X.; and Hu, P
[30]

Bootstrapping Multi-view Learning for Test-time Noisy Correspondence. InCVPR
[31]

Zhang, F.; Gu, Z.; and Wang, H. 2026. Decoding with struc- tured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation. In AAAI, volume 40, 12421–12429

2026
[32]

Wang, H.; Lu, J.; and Zhang, F. 2026. EEO-TFV: Escape- Explore Optimizer for Web-Scale Time-Series Forecasting and Vision Analysis.arXiv preprint arXiv:2602.02551

work page arXiv 2026
[33]

Zhou, S.; Li, L.; Zhang, X.; Zhang, B.; Bai, S.; Sun, M.; Zhao, Z.; Lu, X.; and Chu, X. 2024. LiDAR-PTQ: Post- Training Quantization for Point Cloud 3D Object Detection

2024
[34]

Wen, J.; Cui, J.; Zhao, Z.; Yan, R.; Gao, Z.; Dou, L.; and Chen, B. M. 2023. SyreaNet: A Physically Guided Under- water Image Enhancement Framework Integrating Synthetic and Real Images. InICRA, 5177–5183

2023
[35]

Lu, S.; Lian, Z.; Zhou, Z.; Zhang, S.; Zhao, C.; and Kong, A. W.-K. 2025. Does FLUX Already Know How to Perform Physically Plausible Image Composition?arXiv preprint arXiv:2509.21278

work page arXiv 2025
[36]

Zhou, Z.; Lu, S.; Leng, S.; Zhang, S.; Lian, Z.; Yu, X.; and Kong, A. W.-K. 2025. DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing.arXiv preprint arXiv:2510.02253

work page arXiv 2025
[37]

Wang, H.; and Zhang, F. 2024. Computing nodes for plane data points by constructing cubic polynomial with con- straints.Computer Aided Geometric Design, 111: 102308

2024
[38]

Yu, Z.; Idris, M. Y . I.; and Wang, P. 2025. Visualizing Our Changing Earth: A Creative AI Framework for Democratiz- ing Environmental Storytelling Through Satellite Imagery. InNeurIPS 2025

2025
[39]

Bi, J.; Wang, Y .; Chen, H.; Xiao, X.; Hecker, A.; Tresp, V .; and Ma, Y . 2025. LLaV A steering: Visual instruction tuning with 500x fewer parameters through modality linear representation-steering. InACL, 15230–15250

2025
[40]

Li, H.; Zhao, J.; Bazin, J.-C.; Kim, P.; Joo, K.; Zhao, Z.; and Liu, Y .-H. 2023. Hong kong world: Leveraging structural regularity for line-based slam.IEEE TPAMI, 45(11): 13035– 13053

2023
[41]

Duan, S.; Wu, W.; Hu, P.; Ren, Z.; Peng, D.; and Sun, Y
[42]

CoPINN: Cognitive physics-informed neural net- works. InICML
[43]

Yu, Z.; Wang, J.; and Idris, M. Y . I. 2025. IIDM: Improved implicit diffusion model with knowledge distillation to es- timate the spatial distribution density of carbon stock in re- mote sensing imagery.KBS, 115131

2025
[44]

Cheng, Z.; Lai, L.; Liu, Y .; Cheng, K.; and Qi, X. 2026. En- hancing Financial Report Question-Answering: A Retrieval- Augmented Generation System with Reranking Analysis. arXiv preprint arXiv:2603.16877

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Chen, W.; Wu, L.; Hu, Y .; Li, Z.; Cheng, Z.; Qian, Y .; Zhu, L.; Hu, Z.; Liang, L.; Tang, Q.; et al. 2025. AutoNeural: Co-Designing Vision-Language Models for NPU Inference. arXiv preprint arXiv:2512.02924

work page arXiv 2025
[46]

Jia, S.; Zhu, N.; Zhong, J.; Zhou, J.; Zhang, H.; Hwang, J.- N.; and Li, L. 2026. RAM: Recover Any 3D Human Motion in-the-Wild. arXiv:2603.19929

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Li, L.; Jia, S.; Wang, J.; Jiang, Z.; Zhou, F.; Dai, J.; Zhang, T.; Wu, Z.; and Hwang, J.-N. 2025. Human Motion Instruc- tion Tuning. InCVPR

2025
[48]

Li, L.; Jia, S.; and Hwang, J.-N. 2026. Multiple Human Motion Understanding. InAAAI, volume 40, 6297–6305

2026
[49]

Ye, A.; Wang, B.; Ni, C.; Huang, G.; Zhao, G.; Li, H.; Li, H.; Li, J.; Lv, J.; Liu, J.; et al. 2026. GigaWorld-Policy: An Efficient Action-Centered World–Action Model.arXiv preprint arXiv:2603.17240

work page arXiv 2026
[50]

Yang, Q.; Lv, P.; Li, Y .; Zhang, S.; Chen, Y .; Chen, Z.; Li, Z.; and Hu, Y . 2026. ERASE: Bypassing Collaborative Detec- tion of AI Counterfeit Via Comprehensive Artifacts Elimi- nation.IEEE TDSC, 1–18

2026
[51]

T.; Peng, X.; and Hu, P

Li, S.; He, C.; Liu, X.; Zhou, J. T.; Peng, X.; and Hu, P. 2025. Learning with Noisy Triplet Correspondence for Composed Image Retrieval. InCVPR, 19628–19637

2025
[52]

Huang, F.; Zhang, L.; Fu, X.; and Song, S. 2024. Dynamic weighted combiner for mixed-modal image retrieval. In AAAI, volume 38, 2303–2311

2024
[53]

Li, Z.; Chen, Z.; Wen, H.; Fu, Z.; Hu, Y .; and Guan, W
[54]

ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval. InAAAI
[55]

Dang, Z.; Luo, M.; Wang, J.; Jia, C.; Han, H.; Wan, H.; Dai, G.; Chang, X.; and Wang, J. 2025. Disentangled noisy cor- respondence learning.IEEE TIP

2025
[56]

Han, H.; Miao, K.; Zheng, Q.; and Luo, M. 2023. Noisy correspondence learning with meta similarity correction. In CVPR, 7517–7526

2023
[57]

Huang, Z.; Niu, G.; Liu, X.; Ding, W.; Xiao, X.; Wu, H.; and Peng, X. 2021. Learning with noisy correspondence for cross-modal matching.NeurIPS, 34: 29406–29419

2021
[58]

Tan, C.; Xia, J.; Wu, L.; and Li, S. Z. 2021. Co-learning: Learning from noisy labels with self-supervision. InACM MM, 1405–1413

2021
[59]

Batson, J.; and Royer, L. 2019. Noise2self: Blind denoising by self-supervision. InICML, 524–533. PMLR

2019
[60]

V o, N.; Jiang, L.; Sun, C.; Murphy, K.; Li, L.; Fei-Fei, L.; and Hays, J. 2019. Composing Text and Image for Image Retrieval - an Empirical Odyssey. InCVPR, 6439–6448. IEEE

2019
[61]

Wen, H.; Song, X.; Yang, X.; Zhan, Y .; and Nie, L. 2021. Comprehensive Linguistic-Visual Composition Network for Image Retrieval. InACM SIGIR, 1369–1378. ACM

2021
[62]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. InICML, 8748–8763. PMLR

2021
[63]

Xie, Z. 2026. CONQUER: Context-Aware Representation with Query Enhancement for Text-Based Person Search. arXiv preprint arXiv:2601.18625

work page arXiv 2026
[64]

Xie, Z.; Liu, X.; Zhang, B.; Lin, Y .; Cai, S.; and Jin, T. 2026. HVD: Human Vision-Driven Video Represen- tation Learning for Text-Video Retrieval.arXiv preprint arXiv:2601.16155

work page arXiv 2026
[65]

Gu, R.; Jia, S.; Ma, Y .; Zhong, J.; Hwang, J.-N.; and Li, L
[66]

InACM MM, 9026–9034

MoCount: Motion-Based Repetitive Action Counting. InACM MM, 9026–9034
[67]

Li, L.; Jia, S.; Wang, J.; An, Z.; Li, J.; Hwang, J.-N.; and Be- longie, S. 2025. Chatmotion: A multimodal multi-agent for human motion analysis.arXiv preprint arXiv:2502.18180

work page arXiv 2025
[68]

Jia, S.; and Li, L. 2024. Adaptive masking enhances visual grounding.arXiv preprint arXiv:2410.03161

work page arXiv 2024
[69]

Liu, L.; Chen, S.; Jia, S.; Shi, J.; Jiang, Z.; Jin, C.; Zongkai, W.; Hwang, J.-N.; and Li, L. 2024. Graph canvas for control- lable 3d scene generation.arXiv preprint arXiv:2412.00091

work page arXiv 2024
[70]

Xie, Z.; Zhang, B.; Lin, Y .; and Jin, T. 2026. Delving deeper: Hierarchical visual perception for robust video-text retrieval.arXiv preprint arXiv:2601.12768

work page arXiv 2026
[71]

Xie, Z.; Wang, C.; Wang, Y .; Cai, S.; Wang, S.; and Jin, T
[72]

InEMNLP, 5259–5270

Chat-driven text generation and interaction for person retrieval. InEMNLP, 5259–5270
[73]

Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained under- standing.arXiv preprint arXiv:2504.07745, 2025

Hu, Y .; Song, Z.; Feng, N.; Luo, Y .; Yu, J.; Chen, Y .- P. P.; and Yang, W. 2025. SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understand- ing.arXiv preprint arXiv:2504.07745

work page arXiv 2025
[74]

Li, X.; Ma, Y .; Huang, Y .; Wang, X.; Lin, Y .; and Zhang, C. 2024. Synergized data efficiency and compression (sec) optimization for large language models. InEIECS, 586–591. IEEE

2024
[75]

Zeng, Y .; Yu, W.; Li, Z.; Ren, T.; Ma, Y .; Cao, J.; Chen, X.; and Yu, T. 2025. Bridging the editing gap in LLMs: Fi- neEdit for precise and targeted text modifications.EMNLP Findings, 2193–2206

2025
[76]

Cao, J.; Ma, Y .; Li, X.; Ren, Q.; and Chen, X. 2026. Task- Specific Efficiency Analysis: When Small Language Mod- els Outperform Large Language Models.arXiv preprint arXiv:2603.21389

work page arXiv 2026
[77]

Sun, Y .; Li, Y .; Ren, Z.; Duan, G.; Peng, D.; and Hu, P. 2025. Roll: Robust noisy pseudo-label learning for multi-view clustering with noisy correspondence. InCVPR, 30732– 30741

2025
[78]

Feng, C.; Tzimiropoulos, G.; and Patras, I. 2022. SSR: An Efficient and Robust Framework for Learning with Un- known Label Noise. InBMVC

2022
[79]

He, C.; Zhu, H.; Hu, P.; and Peng, X. 2024. Robust Vari- ational Contrastive Learning for Partially View-unaligned Clustering. InACM MM, 4167–4176

2024
[80]

Feng, C.; Tzimiropoulos, G.; and Patras, I. 2024. NoiseBox: Towards More Efficient and Effective Learning with Noisy Labels.IEEE TCSVT

2024

Showing first 80 references.