pith. machine review for the scientific record. sign in

arxiv: 2604.18051 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords composed image retrievalnoisy triplet correspondencevisual invariancefast fourier transformdiscriminative learningnoise mitigationmultimodal retrieval
0
0 comments X

The pith

INTENT mitigates both cross-modal and modality-inherent noise in composed image retrieval using visual invariance and discriminative learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Composed image retrieval matches a reference image to a target via a modification text, yet real datasets contain annotation errors that produce noisy triplets. The paper separates this into cross-modal correspondence noise from mismatches and modality-inherent noise from intra-image factors such as background clutter. INTENT counters the first through bi-objective optimization on positive and negative samples plus loyalty-adjusted decision boundaries, while the second is handled by applying causal intervention via Fast Fourier Transform on visual features to produce invariant composed representations. This dual handling yields more robust retrieval than prior methods that addressed only one noise type. Readers focused on practical deployment see value in a system that tolerates imperfect labeling without requiring costly re-annotation.

Core claim

The central claim is that the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT) handles two types of noise in CIR: cross-modal correspondence noise through bi-objective discriminative learning that optimizes collaboratively with positive and negative samples and constructs a scalable decision boundary based on loyalty degree, and modality-inherent noise through Visual Invariant Composition that applies causal intervention via Fast Fourier Transform to generate intervened composed features enforcing visual invariance.

What carries the argument

Visual Invariant Composition component that performs causal intervention via Fast Fourier Transform on visual features to generate intervened composed features enforcing visual invariance.

Load-bearing premise

The assumption that Fast Fourier Transform-based causal intervention on the visual side produces intervened features that enforce visual invariance and filter modality-inherent noise without discarding useful compositional signals.

What would settle it

A controlled test that injects synthetic modality-inherent noise such as background variations into otherwise clean triplets and checks whether INTENT's FFT intervention measurably improves retrieval accuracy over an ablated version lacking the intervention.

Figures

Figures reproduced from arXiv: 2604.18051 by Jiale Huang, Qinlei Huang, Yinwei Wei, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Zixu Li.

Figure 1
Figure 1. Figure 1: (a) shows typical Modality-inherent Noise in CIR. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The causal graph of CIR. Solid arrows present the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The framework of our proposed INTENT. We designed (a) Visual Invariant Composition and (b) Bi-Objective Dis [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case Study on (a) CIRR and (b) FashionIQ. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the effects of different interven [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Similarity (left) and loyalty degree (right) distribu [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of similarity matrices between TME [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study on FashionIQ dataset. cation of MLLMs to Visual Object Tracking via Reinforce￾ment Learning. arXiv:2506.21980. [4] Ni, C.; Wang, X.; Zhu, Z.; Wang, W.; Li, H.; Zhao, G.; Li, J.; Qin, W.; Huang, G.; and Mei, W. 2025. Wonderturbo: Gen￾erating interactive 3d world in 0.72 seconds. arXiv preprint arXiv:2504.02261. [5] Ni, C.; Zhao, G.; Wang, X.; Zhu, Z.; Qin, W.; Chen, X.; Jia, G.; Huang, G.; and Me… view at source ↗
Figure 9
Figure 9. Figure 9: Case study on CIRR dataset. [15] Li, Z.; Hu, Y.; Chen, Z.; Huang, Q.; Qiu, G.; Fu, Z.; and Liu, M. 2026. ReTrack: Evidence-Driven Dual-Stream Di￾rectional Anchor Calibration Network for Composed Video Retrieval. In AAAI, volume 40, 23373–23381. [16] Hu, Y.; Li, Z.; Chen, Z.; Huang, Q.; Fu, Z.; Xu, M.; and Nie, L. 2026. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement. ACM … view at source ↗
read the original abstract

Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained modification annotations. However, modality-inherent noise is often overlooked, and research on cross-modal correspondence noise remains nascent. To tackle above issues, we propose the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT), comprising two components: Visual Invariant Composition and Bi-Objective Discriminative Learning, specifically designed to handle the two-aspect noise. The former applies causal intervention on the visual side via Fast Fourier Transform (FFT) to generate intervened composed features, enforcing visual invariance and enabling the model to ignore modality-inherent noise during composition. The latter adopts collaborative optimization with both positive and negative samples, and constructs a scalable decision boundary that dynamically adjusts decisions based on the loyalty degree, enabling robust correspondence discrimination. Extensive experiments on two widely used benchmark datasets demonstrate the superiority and robustness of INTENT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper addresses noisy triplet correspondences in Composed Image Retrieval (CIR), categorizing noise into cross-modal correspondence noise (modality mismatches) and modality-inherent noise (intra-modal backgrounds or factors irrelevant to coarse text annotations). It proposes INTENT with two components: Visual Invariant Composition, which applies FFT-based causal intervention on visual features to produce intervened composed representations that enforce invariance and allow ignoring modality-inherent noise, and Bi-Objective Discriminative Learning, which performs collaborative optimization over positive and negative samples while constructing a loyalty-degree-adjusted decision boundary for robust discrimination. Experiments on two standard CIR benchmarks are reported to demonstrate improved retrieval performance and robustness.

Significance. If the FFT intervention and bi-objective learning components deliver the claimed invariance and discrimination properties, the work would usefully extend CIR research toward practical robustness against annotation noise, an issue that is common yet under-addressed. The explicit separation of noise types and the frequency-domain approach to visual invariance constitute a concrete technical contribution that could influence subsequent multimodal retrieval methods.

major comments (2)
  1. [§3.2] Visual Invariant Composition (method section): the claim that FFT constitutes a causal intervention generating intervened composed features that enforce visual invariance is load-bearing for the modality-inherent noise robustness argument. The manuscript must specify the underlying causal graph, the precise do-operator realization (which frequencies are treated as the noise variable, whether via amplitude masking, phase randomization, or replacement), and why the operation is not merely a heuristic filter. Without this, the invariance guarantee does not follow for cases where background correlates spuriously with coarse text.
  2. [§3.3] Bi-Objective Discriminative Learning (method section): the collaborative optimization and loyalty-degree dynamic boundary are presented as enabling robust correspondence discrimination, yet the paper should demonstrate via targeted ablations that this component isolates cross-modal noise handling rather than simply improving overall fitting. The interaction between the two objectives and any additional hyperparameters introduced by the loyalty mechanism must be shown not to undermine the claimed parameter efficiency.
minor comments (2)
  1. The abstract states that experiments demonstrate superiority but does not report concrete recall@K or mAP deltas relative to the strongest baselines; these numbers should appear in the abstract or a summary table for immediate assessment.
  2. [§3.3] Notation for the loyalty degree and the resulting decision boundary should be introduced with explicit equations rather than descriptive text alone to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help strengthen the technical foundations of our work. We address each major comment in detail below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§3.2] Visual Invariant Composition (method section): the claim that FFT constitutes a causal intervention generating intervened composed features that enforce visual invariance is load-bearing for the modality-inherent noise robustness argument. The manuscript must specify the underlying causal graph, the precise do-operator realization (which frequencies are treated as the noise variable, whether via amplitude masking, phase randomization, or replacement), and why the operation is not merely a heuristic filter. Without this, the invariance guarantee does not follow for cases where background correlates spuriously with coarse text.

    Authors: We agree that the current description of the FFT-based intervention would benefit from greater formalization to rigorously support the invariance claims. In the revised manuscript we will add to §3.2: (i) an explicit causal graph in which modality-inherent noise (background factors uncorrelated with the modification text) acts as a confounder on the visual feature extractor; (ii) the precise do-operator implementation realized by amplitude masking—specifically, replacing the amplitude spectrum of low-frequency bins identified as noise carriers with dataset-wide mean amplitudes while retaining the original phase to preserve semantic content; and (iii) a short theoretical argument showing that, under the stated graph, the intervention removes the back-door path from noise to the composed representation, distinguishing the approach from a purely heuristic filter. These additions will directly address the concern about spurious correlations. revision: yes

  2. Referee: [§3.3] Bi-Objective Discriminative Learning (method section): the collaborative optimization and loyalty-degree dynamic boundary are presented as enabling robust correspondence discrimination, yet the paper should demonstrate via targeted ablations that this component isolates cross-modal noise handling rather than simply improving overall fitting. The interaction between the two objectives and any additional hyperparameters introduced by the loyalty mechanism must be shown not to undermine the claimed parameter efficiency.

    Authors: We accept that targeted evidence is required to isolate the noise-specific benefit. In the revision we will insert new ablation studies that (a) evaluate the bi-objective loss on controlled subsets containing only cross-modal mismatches (synthetically introduced) versus clean data, and (b) compare against single-objective variants to quantify the incremental gain attributable to collaborative positive/negative optimization. We will also report a hyperparameter sweep for the loyalty-degree scalar, demonstrating that it adds only one learnable scalar per mini-batch (re-using existing similarity scores) and does not increase overall parameter count or training time beyond 2 %. The interaction between the two objectives will be illustrated via training curves that separate the contribution of each term under noisy conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: method proposal is self-contained empirical design

full rationale

The paper presents INTENT as a new architecture with two explicitly described components for NTC noise handling. Visual Invariant Composition is introduced as an application of FFT-based intervention (a design choice, not a derived prediction), and Bi-Objective Discriminative Learning is described as collaborative optimization with loyalty-based boundaries. No equations, derivations, or first-principles results appear that reduce any claimed invariance or discrimination property to fitted parameters, self-definitions, or prior self-citations by construction. The central claims rest on the proposed modules' behavior on benchmarks rather than any tautological reduction. This is the expected non-finding for a methods paper whose contributions are architectural and empirical.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that noise in CIR cleanly separates into cross-modal correspondence noise and modality-inherent noise, each addressable by one dedicated component.

axioms (1)
  • domain assumption Noise in CIR datasets can be categorized into cross-modal correspondence noise and modality-inherent noise.
    Explicitly stated in the abstract as the foundation for designing the two components of INTENT.

pith-pipeline@v0.9.0 · 5605 in / 1235 out tokens · 29461 ms · 2026-05-10T04:38:45.833156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

129 extracted references · 32 canonical work pages · 4 internal anchors

  1. [1]

    Wang, Y .; Fu, T.; Xu, Y .; Ma, Z.; Xu, H.; Du, B.; Lu, Y .; Gao, H.; Wu, J.; and Chen, J. 2024. TWIN-GPT: digital twins for clinical trials via large language model.ACM ToMM

  2. [2]

    Ge, J.; Cao, J.; Li, X.; Zhu, X.; Liu, C.; Liu, B.; Feng, C.; and Patras, I. 2025. Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly- Supervised Camouflaged Object Detection with Scribble Annotations.arXiv preprint arXiv:2512.20260

  3. [3]

    Wang, B.; Li, W.; and Ge, J. 2025. R1-Track: Direct Appli- RetrievalResultsINTENT TME Multimodal Query “has larger graphics and has smaller lettering on it“ W/O VIC RetrievalResultsINTENT TME Multimodal Query “multi colored wihwaist tie and is blue and white with longer sleeves“ W/O VIC RetrievalResultsINTENT TME Multimodal Query “is shorter and has anima...

  4. [4]

    Ni, C.; Wang, X.; Zhu, Z.; Wang, W.; Li, H.; Zhao, G.; Li, J.; Qin, W.; Huang, G.; and Mei, W. 2025. Wonderturbo: Gen- erating interactive 3d world in 0.72 seconds.arXiv preprint arXiv:2504.02261

  5. [5]

    Ni, C.; Zhao, G.; Wang, X.; Zhu, Z.; Qin, W.; Chen, X.; Jia, G.; Huang, G.; and Mei, W. 2025. Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction.arXiv preprint arXiv:2508.08170

  6. [6]

    Ni, C.; Zhao, G.; Wang, X.; Zhu, Z.; Qin, W.; Huang, G.; Liu, C.; Chen, Y .; Wang, Y .; Zhang, X.; et al. 2025. Re- condreamer: Crafting world models for driving scene recon- struction via online restoration. InCVPR, 1559–1569

  7. [7]

    Xu, X.; Liu, Y .; Khan, S.; Khan, F.; Zuo, W.; Goh, R. S. M.; Feng, C.-M.; et al. 2024. Sentence-level Prompts Benefit Composed Image Retrieval. InICLR

  8. [8]

    Wen, H.; Song, X.; Yin, J.; Wu, J.; Guan, W.; and Nie, L

  9. [9]

    Self-Training Boosted Multi-Factor Matching Net- work for Composed Image Retrieval.IEEE TPAMI

  10. [10]

    Yang, X.; Liu, D.; Zhang, H.; Luo, Y .; Wang, C.; and Zhang, J. 2024. Decomposing Semantic Shifts for Composed Image Retrieval. InAAAI, volume 38, 6576–6584

  11. [11]

    Li, Z.; Hu, Y .; Chen, Z.; Zhang, S.; Huang, Q.; Fu, Z.; and Wei, Y . 2026. HABIT: Chrono-Synergia Robust Progres- sive Learning Framework for Composed Image Retrieval. InAAAI, volume 40, 6762–6770

  12. [12]

    Zhang, M.; Li, Z.; Chen, Z.; Fu, Z.; Zhu, X.; Nie, J.; Wei, Y .; and Hu, Y . 2026. Hint: Composed image retrieval with dual- path compositional contextualized network.arXiv preprint arXiv:2603.26341

  13. [13]

    Qiu, G.; Chen, Z.; Li, Z.; Huang, Q.; Fu, Z.; Song, X.; and Hu, Y . 2026. MELT: Improve Composed Image Re- trieval via the Modification Frequentation-Rarity Balance Network.arXiv preprint arXiv:2603.29291

  14. [14]

    Chen, K.; Fang, P.; and Xue, H. 2025. DePro: Domain Ensemble using Decoupled Prompts for Universal Cross- Domain Retrieval. InACM SIGIR, SIGIR ’25, 958–967

  15. [15]

    The pelican is swimming instead of standing and the color has changed to white

    Chen, K.; Fang, P.; and Xue, H. 2025. Multi-Modal Inter- active Agent Layer for Few-Shot Universal Cross-Domain Retrieval and Beyond. InNeurIPS. RetrievalResultsINTENT TME Multimodal Query “The pelican is swimming instead of standing and the color has changed to white” W/O VIC RetrievalResultsINTENT TME Multimodal Query “Change to a barber shop and includ...

  16. [16]

    Li, Z.; Hu, Y .; Chen, Z.; Huang, Q.; Qiu, G.; Fu, Z.; and Liu, M. 2026. ReTrack: Evidence-Driven Dual-Stream Di- rectional Anchor Calibration Network for Composed Video Retrieval. InAAAI, volume 40, 23373–23381

  17. [17]

    Hu, Y .; Li, Z.; Chen, Z.; Huang, Q.; Fu, Z.; Xu, M.; and Nie, L. 2026. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement.ACM ToMM

  18. [18]

    Yang, Q.; Chen, Z.; Hu, Y .; Li, Z.; Fu, Z.; and Nie, L. 2026. STABLE: Efficient Hybrid Nearest Neighbor Search via Magnitude-Uniformity and Cardinality-Robustness.arXiv preprint arXiv:2604.01617

  19. [19]

    Lan, Y .; Xu, S.; Su, C.; Ye, R.; Peng, D.; and Sun, Y . 2025. Multi-view Hashing Classification. InACM MM, 2122– 2130

  20. [20]

    Liu, J.; Zhuo, D.; Feng, Z.; Zhu, S.; Peng, C.; Liu, Z.; and Wang, H. 2024. Dvlo: Deep visual-lidar odometry with local-to-global feature fusion and bi-directional structure alignment. InECCV, 475–493. Springer

  21. [21]

    Jiang, L.; Wang, X.; Zhang, F.; and Zhang, C. 2025. Trans- forming time and space: efficient video super-resolution with hybrid attention and deformable transformers.The Vi- sual Computer, 1–12

  22. [22]

    Bi, J.; Yan, D.; Wang, Y .; Huang, W.; Chen, H.; Wan, G.; Ye, M.; Xiao, X.; Schuetze, H.; Tresp, V .; et al. 2025. CoT- Kinetics: A Theoretical Modeling Assessing LRM Reason- ing Process.arXiv preprint arXiv:2505.13408

  23. [23]

    Zhou, S.; Yuan, Z.; Yang, D.; Zhao, Z.; Hu, X.; Shi, Y .; Lu, X.; and Wu, Q. 2024. Information Entropy Guided Height- aware Histogram for Quantization-friendly Pillar Feature Encoder.arXiv preprint arXiv:2405.18734

  24. [24]

    S.; Sheng, Z.; and Yang, B

    Qiu, X.; Hu, J.; Zhou, L.; Wu, X.; Du, J.; Zhang, B.; Guo, C.; Zhou, A.; Jensen, C. S.; Sheng, Z.; and Yang, B. 2024. TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods. InProc. VLDB Endow., 2363–2377

  25. [25]

    Feng, C.; and Patras, I. 2023. MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset. InCVPR

  26. [26]

    Zhou, S.; Cao, Y .; Nie, J.; Fu, Y .; Zhao, Z.; Lu, X.; and Wang, S. 2026. Comptrack: Information bottleneck-guided low-rank dynamic token compression for point cloud track- ing. InAAAI, volume 40, 13773–13781

  27. [27]

    Yuan, H.; Li, X.; Dai, J.; You, X.; Sun, Y .; and Ren, Z. 2025. Deep Streaming View Clustering. InICML

  28. [28]

    Qiu, X.; Wu, X.; Lin, Y .; Guo, C.; Hu, J.; and Yang, B. 2025. DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting. InSIGKDD, 1185–1196

  29. [29]

    He, C.; Xue, D.; Li, S.; Hao, Y .; Peng, X.; and Hu, P

  30. [30]

    Bootstrapping Multi-view Learning for Test-time Noisy Correspondence. InCVPR

  31. [31]

    Zhang, F.; Gu, Z.; and Wang, H. 2026. Decoding with struc- tured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation. In AAAI, volume 40, 12421–12429

  32. [32]

    Wang, H.; Lu, J.; and Zhang, F. 2026. EEO-TFV: Escape- Explore Optimizer for Web-Scale Time-Series Forecasting and Vision Analysis.arXiv preprint arXiv:2602.02551

  33. [33]

    Zhou, S.; Li, L.; Zhang, X.; Zhang, B.; Bai, S.; Sun, M.; Zhao, Z.; Lu, X.; and Chu, X. 2024. LiDAR-PTQ: Post- Training Quantization for Point Cloud 3D Object Detection

  34. [34]

    Wen, J.; Cui, J.; Zhao, Z.; Yan, R.; Gao, Z.; Dou, L.; and Chen, B. M. 2023. SyreaNet: A Physically Guided Under- water Image Enhancement Framework Integrating Synthetic and Real Images. InICRA, 5177–5183

  35. [35]

    Lu, S.; Lian, Z.; Zhou, Z.; Zhang, S.; Zhao, C.; and Kong, A. W.-K. 2025. Does FLUX Already Know How to Perform Physically Plausible Image Composition?arXiv preprint arXiv:2509.21278

  36. [36]

    Zhou, Z.; Lu, S.; Leng, S.; Zhang, S.; Lian, Z.; Yu, X.; and Kong, A. W.-K. 2025. DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing.arXiv preprint arXiv:2510.02253

  37. [37]

    Wang, H.; and Zhang, F. 2024. Computing nodes for plane data points by constructing cubic polynomial with con- straints.Computer Aided Geometric Design, 111: 102308

  38. [38]

    Yu, Z.; Idris, M. Y . I.; and Wang, P. 2025. Visualizing Our Changing Earth: A Creative AI Framework for Democratiz- ing Environmental Storytelling Through Satellite Imagery. InNeurIPS 2025

  39. [39]

    Bi, J.; Wang, Y .; Chen, H.; Xiao, X.; Hecker, A.; Tresp, V .; and Ma, Y . 2025. LLaV A steering: Visual instruction tuning with 500x fewer parameters through modality linear representation-steering. InACL, 15230–15250

  40. [40]

    Li, H.; Zhao, J.; Bazin, J.-C.; Kim, P.; Joo, K.; Zhao, Z.; and Liu, Y .-H. 2023. Hong kong world: Leveraging structural regularity for line-based slam.IEEE TPAMI, 45(11): 13035– 13053

  41. [41]

    Duan, S.; Wu, W.; Hu, P.; Ren, Z.; Peng, D.; and Sun, Y

  42. [42]

    CoPINN: Cognitive physics-informed neural net- works. InICML

  43. [43]

    Yu, Z.; Wang, J.; and Idris, M. Y . I. 2025. IIDM: Improved implicit diffusion model with knowledge distillation to es- timate the spatial distribution density of carbon stock in re- mote sensing imagery.KBS, 115131

  44. [44]

    Cheng, Z.; Lai, L.; Liu, Y .; Cheng, K.; and Qi, X. 2026. En- hancing Financial Report Question-Answering: A Retrieval- Augmented Generation System with Reranking Analysis. arXiv preprint arXiv:2603.16877

  45. [45]

    Chen, W.; Wu, L.; Hu, Y .; Li, Z.; Cheng, Z.; Qian, Y .; Zhu, L.; Hu, Z.; Liang, L.; Tang, Q.; et al. 2025. AutoNeural: Co-Designing Vision-Language Models for NPU Inference. arXiv preprint arXiv:2512.02924

  46. [46]

    Jia, S.; Zhu, N.; Zhong, J.; Zhou, J.; Zhang, H.; Hwang, J.- N.; and Li, L. 2026. RAM: Recover Any 3D Human Motion in-the-Wild. arXiv:2603.19929

  47. [47]

    Li, L.; Jia, S.; Wang, J.; Jiang, Z.; Zhou, F.; Dai, J.; Zhang, T.; Wu, Z.; and Hwang, J.-N. 2025. Human Motion Instruc- tion Tuning. InCVPR

  48. [48]

    Li, L.; Jia, S.; and Hwang, J.-N. 2026. Multiple Human Motion Understanding. InAAAI, volume 40, 6297–6305

  49. [49]

    Ye, A.; Wang, B.; Ni, C.; Huang, G.; Zhao, G.; Li, H.; Li, H.; Li, J.; Lv, J.; Liu, J.; et al. 2026. GigaWorld-Policy: An Efficient Action-Centered World–Action Model.arXiv preprint arXiv:2603.17240

  50. [50]

    Yang, Q.; Lv, P.; Li, Y .; Zhang, S.; Chen, Y .; Chen, Z.; Li, Z.; and Hu, Y . 2026. ERASE: Bypassing Collaborative Detec- tion of AI Counterfeit Via Comprehensive Artifacts Elimi- nation.IEEE TDSC, 1–18

  51. [51]

    T.; Peng, X.; and Hu, P

    Li, S.; He, C.; Liu, X.; Zhou, J. T.; Peng, X.; and Hu, P. 2025. Learning with Noisy Triplet Correspondence for Composed Image Retrieval. InCVPR, 19628–19637

  52. [52]

    Huang, F.; Zhang, L.; Fu, X.; and Song, S. 2024. Dynamic weighted combiner for mixed-modal image retrieval. In AAAI, volume 38, 2303–2311

  53. [53]

    Li, Z.; Chen, Z.; Wen, H.; Fu, Z.; Hu, Y .; and Guan, W

  54. [54]

    ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval. InAAAI

  55. [55]

    Dang, Z.; Luo, M.; Wang, J.; Jia, C.; Han, H.; Wan, H.; Dai, G.; Chang, X.; and Wang, J. 2025. Disentangled noisy cor- respondence learning.IEEE TIP

  56. [56]

    Han, H.; Miao, K.; Zheng, Q.; and Luo, M. 2023. Noisy correspondence learning with meta similarity correction. In CVPR, 7517–7526

  57. [57]

    Huang, Z.; Niu, G.; Liu, X.; Ding, W.; Xiao, X.; Wu, H.; and Peng, X. 2021. Learning with noisy correspondence for cross-modal matching.NeurIPS, 34: 29406–29419

  58. [58]

    Tan, C.; Xia, J.; Wu, L.; and Li, S. Z. 2021. Co-learning: Learning from noisy labels with self-supervision. InACM MM, 1405–1413

  59. [59]

    Batson, J.; and Royer, L. 2019. Noise2self: Blind denoising by self-supervision. InICML, 524–533. PMLR

  60. [60]

    V o, N.; Jiang, L.; Sun, C.; Murphy, K.; Li, L.; Fei-Fei, L.; and Hays, J. 2019. Composing Text and Image for Image Retrieval - an Empirical Odyssey. InCVPR, 6439–6448. IEEE

  61. [61]

    Wen, H.; Song, X.; Yang, X.; Zhan, Y .; and Nie, L. 2021. Comprehensive Linguistic-Visual Composition Network for Image Retrieval. InACM SIGIR, 1369–1378. ACM

  62. [62]

    W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

    Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. InICML, 8748–8763. PMLR

  63. [63]

    Xie, Z. 2026. CONQUER: Context-Aware Representation with Query Enhancement for Text-Based Person Search. arXiv preprint arXiv:2601.18625

  64. [64]

    Xie, Z.; Liu, X.; Zhang, B.; Lin, Y .; Cai, S.; and Jin, T. 2026. HVD: Human Vision-Driven Video Represen- tation Learning for Text-Video Retrieval.arXiv preprint arXiv:2601.16155

  65. [65]

    Gu, R.; Jia, S.; Ma, Y .; Zhong, J.; Hwang, J.-N.; and Li, L

  66. [66]

    InACM MM, 9026–9034

    MoCount: Motion-Based Repetitive Action Counting. InACM MM, 9026–9034

  67. [67]

    Li, L.; Jia, S.; Wang, J.; An, Z.; Li, J.; Hwang, J.-N.; and Be- longie, S. 2025. Chatmotion: A multimodal multi-agent for human motion analysis.arXiv preprint arXiv:2502.18180

  68. [68]

    Jia, S.; and Li, L. 2024. Adaptive masking enhances visual grounding.arXiv preprint arXiv:2410.03161

  69. [69]

    Liu, L.; Chen, S.; Jia, S.; Shi, J.; Jiang, Z.; Jin, C.; Zongkai, W.; Hwang, J.-N.; and Li, L. 2024. Graph canvas for control- lable 3d scene generation.arXiv preprint arXiv:2412.00091

  70. [70]

    Xie, Z.; Zhang, B.; Lin, Y .; and Jin, T. 2026. Delving deeper: Hierarchical visual perception for robust video-text retrieval.arXiv preprint arXiv:2601.12768

  71. [71]

    Xie, Z.; Wang, C.; Wang, Y .; Cai, S.; Wang, S.; and Jin, T

  72. [72]

    InEMNLP, 5259–5270

    Chat-driven text generation and interaction for person retrieval. InEMNLP, 5259–5270

  73. [73]

    Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained under- standing.arXiv preprint arXiv:2504.07745, 2025

    Hu, Y .; Song, Z.; Feng, N.; Luo, Y .; Yu, J.; Chen, Y .- P. P.; and Yang, W. 2025. SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understand- ing.arXiv preprint arXiv:2504.07745

  74. [74]

    Li, X.; Ma, Y .; Huang, Y .; Wang, X.; Lin, Y .; and Zhang, C. 2024. Synergized data efficiency and compression (sec) optimization for large language models. InEIECS, 586–591. IEEE

  75. [75]

    Zeng, Y .; Yu, W.; Li, Z.; Ren, T.; Ma, Y .; Cao, J.; Chen, X.; and Yu, T. 2025. Bridging the editing gap in LLMs: Fi- neEdit for precise and targeted text modifications.EMNLP Findings, 2193–2206

  76. [76]

    Cao, J.; Ma, Y .; Li, X.; Ren, Q.; and Chen, X. 2026. Task- Specific Efficiency Analysis: When Small Language Mod- els Outperform Large Language Models.arXiv preprint arXiv:2603.21389

  77. [77]

    Sun, Y .; Li, Y .; Ren, Z.; Duan, G.; Peng, D.; and Hu, P. 2025. Roll: Robust noisy pseudo-label learning for multi-view clustering with noisy correspondence. InCVPR, 30732– 30741

  78. [78]

    Feng, C.; Tzimiropoulos, G.; and Patras, I. 2022. SSR: An Efficient and Robust Framework for Learning with Un- known Label Noise. InBMVC

  79. [79]

    He, C.; Zhu, H.; Hu, P.; and Peng, X. 2024. Robust Vari- ational Contrastive Learning for Partially View-unaligned Clustering. InACM MM, 4167–4176

  80. [80]

    Feng, C.; Tzimiropoulos, G.; and Patras, I. 2024. NoiseBox: Towards More Efficient and Effective Learning with Noisy Labels.IEEE TCSVT

Showing first 80 references.