pith. machine review for the scientific record. sign in

arxiv: 2604.18376 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Towards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image retrievalperson retrievalmulti-view reformulationsemantic compensationexpression driftlarge language modelscross-modal alignment
0
0 comments X

The pith

LLM-generated multi-view text reformulations compensate for phrasing variations to improve text-to-image person retrieval without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that large language models can generate multiple semantically equivalent but differently worded versions of a person description, and that combining the features from these versions through averaging and residual links can reduce mismatches in the embedding space. This matters because natural language descriptions of the same person vary widely in phrasing, causing even correct image-text pairs to appear distant in current models. The framework applies the same reformulation idea to descriptions generated from images by vision-language models to close visual semantic gaps. If the compensation works, any existing retrieval system can be made more robust simply by rewriting and averaging inputs at test time rather than retraining the model.

Core claim

The authors claim that a training-free semantic compensation framework uses LLMs to produce multi-view reformulations of texts through dual-branch prompting that extracts key visual features for guidance while encouraging diversity, then applies mean-pooling across the resulting embeddings with residual connections to reinforce shared meaning and suppress phrasing noise; a parallel visual compensation step generates multi-perspective image descriptions that are themselves reformulated to address implicit visual semantics, yielding improved cross-modal alignment for person retrieval.

What carries the argument

Multi-view reformulation that uses LLM dual-branch prompting for key-feature-guided and diversity-aware text variants, followed by mean-pooling of embeddings plus residual connections to capture semantic echoes.

If this is right

  • Existing text-to-image retrieval models gain accuracy on person search tasks when the compensation is applied at inference time with no retraining.
  • State-of-the-art results are reached on three standard text-to-image person retrieval datasets.
  • Both textual phrasing variations and visual semantic gaps are handled by the same reformulation and compensation steps.
  • The improvements require no changes to model parameters or additional training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same input-enrichment idea could be tested on related cross-modal tasks such as text-to-video retrieval where phrasing diversity also causes mismatches.
  • If the averaging reliably extracts consistent signals, simpler base models might approach the performance of more heavily trained systems simply by using richer inputs.
  • The method's effectiveness will likely scale with future improvements in how accurately language models generate meaning-preserving paraphrases.

Load-bearing premise

That the reformulations created by the language model keep exactly the same meaning as the original while differing enough in wording for averaging to reduce the embedding shifts caused by those wording differences.

What would settle it

If applying the multi-view reformulation and feature averaging to any base retrieval model produces no gain or a loss in standard accuracy metrics such as rank-1 or mean average precision on the three person retrieval benchmarks, the compensation benefit would be refuted.

Figures

Figures reproduced from arXiv: 2604.18376 by Chao Yuan, Guanglin Niu, Haoxuan Xu, Yujian Zhao.

Figure 1
Figure 1. Figure 1: Cosine similarity visualization between word vectors and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed framework (MVR) for the text-to-image person retrieval, which is a training-free framework could be [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmap of R1 accuracy and mAP on the RSTPReid [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a)Impact of different temperature τ on RSTPReid dataset with IRRA. (b)Impact of query feature compensation scale on RSTPReid dataset with IRRA baseline. employ Rank-k accuracy (with k = 1, 5, 10) and mean Av￾erage Precision (mAP) as the primary evaluation metrics across all three datasets. These metrics collectively assess the retrieval accuracy and overall matching quality. A higher value in either Rank-… view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison of top-10 retrieved results between IRRA(baseline) and our method. Green boxes indicate correct matches, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: T-sne visualizations of the baseline and ours method. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

In text-to-image person retrieval tasks, the diversity of natural language expressions and the implicitness of visual semantics often lead to the problem of Expression Drift, where semantically equivalent texts exhibit significant feature discrepancies in the embedding space due to phrasing variations, thereby degrading the robustness of image-text alignment. This paper proposes a semantic compensation framework (MVR) driven by Large Language Models (LLMs), which enhances cross-modal representation consistency through multi-view semantic reformulation and feature compensation. The core methodology comprises three components: Multi-View Reformulation (MVR): A dual-branch prompting strategy combines key feature guidance (extracting visually critical components via feature similarity) and diversity-aware rewriting to generate semantically equivalent yet distributionally diverse textual variants; Textual Feature Robustness Enhancement: A training-free latent space compensation mechanism suppresses noise interference through multi-view feature mean-pooling and residual connections, effectively capturing "Semantic Echoes"; Visual Semantic Compensation: VLM generates multi-perspective image descriptions, which are further enhanced through shared text reformulation to address visual semantic gaps. Experiments demonstrate that our method can improve the accuracy of the original model well without training and performs SOTA on three text-to-image person retrieval datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a training-free Multi-View Reformulation (MVR) framework for text-to-image person retrieval to address 'Expression Drift' caused by phrasing variations. It uses LLMs in a dual-branch prompting strategy (key-feature guidance plus diversity-aware rewriting) to generate semantically equivalent yet diverse text variants, applies mean-pooling with residual connections to capture 'Semantic Echoes' for feature compensation, and incorporates VLM-generated multi-perspective image descriptions. The authors claim this improves base model accuracy without training and achieves SOTA results on three datasets.

Significance. If the core compensation mechanism proves reliable, the approach offers a practical, parameter-free way to boost robustness in existing cross-modal retrieval systems by leveraging LLMs/VLMs for semantic augmentation. This could be valuable for real-world person retrieval where linguistic variability is common, provided the gains are shown to stem from the proposed 'Semantic Echoes' rather than incidental effects.

major comments (2)
  1. [Abstract / Textual Feature Robustness Enhancement] Abstract and Textual Feature Robustness Enhancement: The central compensation step relies on the unverified premise that LLM-generated multi-view reformulations remain semantically equivalent to the originals (while adding distributional diversity) so that mean-pooling plus residuals can capture 'Semantic Echoes'. No quantitative checks (embedding cosine similarity, entailment scores, or human equivalence ratings) are reported between original and reformulated texts, leaving open the possibility that the method injects noise rather than compensating for drift.
  2. [Abstract] Abstract: The claim of SOTA performance and consistent accuracy improvements without training is asserted without any reported metrics, baselines, ablation studies, dataset details, or statistical tests. This makes it impossible to evaluate whether reported gains exceed noise or post-hoc selection effects, which is load-bearing for the central claim of robust semantic compensation.
minor comments (1)
  1. [Methodology] The term 'Semantic Echoes' is introduced without a formal definition or mathematical formulation, which could be clarified in the methodology section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to specific revisions where appropriate to strengthen the paper.

read point-by-point responses
  1. Referee: The central compensation step relies on the unverified premise that LLM-generated multi-view reformulations remain semantically equivalent to the originals (while adding distributional diversity) so that mean-pooling plus residuals can capture 'Semantic Echoes'. No quantitative checks (embedding cosine similarity, entailment scores, or human equivalence ratings) are reported between original and reformulated texts, leaving open the possibility that the method injects noise rather than compensating for drift.

    Authors: We appreciate this observation on the need for explicit verification of semantic equivalence. The dual-branch prompting strategy (key-feature guidance combined with diversity-aware rewriting) is explicitly designed to preserve core semantics while introducing distributional variety, as described in Section 3.1. However, we acknowledge that the original submission did not include quantitative metrics such as embedding cosine similarities or entailment scores between original and reformulated texts. In the revised manuscript, we will add these analyses (using the same embedding model as the retrieval backbone and an off-the-shelf NLI model for entailment) to empirically support that mean-pooling captures 'Semantic Echoes' rather than noise. Human equivalence ratings on a sample subset will also be included if space permits. revision: yes

  2. Referee: The claim of SOTA performance and consistent accuracy improvements without training is asserted without any reported metrics, baselines, ablation studies, dataset details, or statistical tests. This makes it impossible to evaluate whether reported gains exceed noise or post-hoc selection effects, which is load-bearing for the central claim of robust semantic compensation.

    Authors: The full manuscript (Sections 4 and 5) reports all requested elements: quantitative metrics (Rank-1, mAP), comparisons against multiple baselines, ablation studies isolating each component (MVR, feature compensation, visual compensation), dataset details (CUHK-PEDES, ICFG-PEDES, RSTPReid), and statistical tests where relevant. The abstract follows conventional practice by summarizing these results at a high level. To directly address the concern, we will revise the abstract to include specific key numbers (e.g., average improvements and SOTA margins) while keeping it concise, ensuring the central claims are immediately verifiable. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the proposed framework

full rationale

The paper describes a training-free methodological pipeline (dual-branch LLM prompting for multi-view reformulation followed by mean-pooling and residual connections for compensation) whose central claims rest on experimental results across three datasets rather than any closed mathematical derivation. No equations, fitted parameters renamed as predictions, self-citations invoked as uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The semantic-equivalence assumption is an unverified premise that affects correctness but does not reduce any claimed result to its inputs by construction, satisfying the default expectation of a self-contained empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on unproven assumptions about LLM semantic fidelity and the utility of pooling; no free parameters are explicitly fitted in the abstract, but the approach implicitly depends on LLM behavior.

axioms (2)
  • domain assumption LLMs can produce semantically equivalent yet distributionally diverse textual variants from key feature guidance
    Invoked in the Multi-View Reformulation component description
  • ad hoc to paper Multi-view mean-pooling with residual connections suppresses noise and captures consistent semantic features
    Core assumption of the Textual Feature Robustness Enhancement step
invented entities (1)
  • Semantic Echoes no independent evidence
    purpose: To name the consistent semantic signal extracted via multi-view pooling
    New term introduced to describe the output of the compensation mechanism

pith-pipeline@v0.9.0 · 5516 in / 1360 out tokens · 58752 ms · 2026-05-10T04:42:45.346985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    Rasa: relation and sensitivity aware representation learning for text-based person search

    Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, and Min Zhang. Rasa: relation and sensitivity aware representation learning for text-based person search. InProceedings of the Thirty-Second Interna- tional Joint Conference on Artificial Intelligence, pages 555– 563, 2023. 2, 5

  3. [3]

    Text-based person search without parallel image-text data

    Yang Bai, Jingyao Wang, Min Cao, Chen Chen, Ziqiang Cao, Liqiang Nie, and Min Zhang. Text-based person search without parallel image-text data. InProceedings of the 31st ACM International Conference on Multimedia, pages 757– 767, 2023. 2

  4. [4]

    An empirical study of clip for text-based person search

    Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, and Min Zhang. An empirical study of clip for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 465–473, 2024. 2, 5

  5. [5]

    Tipcb: A simple but effective part-based convolutional baseline for text-based person search.Neuro- computing, 494:171–181, 2022

    Yuhao Chen, Guoqing Zhang, Yujiang Lu, Zhenxing Wang, and Yuhui Zheng. Tipcb: A simple but effective part-based convolutional baseline for text-based person search.Neuro- computing, 494:171–181, 2022. 2

  6. [6]

    arXiv preprint arXiv:2107.12666 (2021) 2, 3, 11, 15

    Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. Semantically self-aligned network for text-to- image part-aware person re-identification.arXiv preprint arXiv:2107.12666, 2021. 1, 2, 5

  7. [7]

    Large-scale pre-training for person re-identification with noisy labels

    Dengpan Fu, Dongdong Chen, Hao Yang, Jianmin Bao, Lu Yuan, Lei Zhang, Houqiang Li, Fang Wen, and Dong Chen. Large-scale pre-training for person re-identification with noisy labels. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 2476–2486, 2022. 2

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3, 7

  9. [9]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2

  10. [10]

    Vgsg: Vision-guided semantic-group network for text- based person search.IEEE Transactions on Image Process- ing, 33:163–176, 2023

    Shuting He, Hao Luo, Wei Jiang, Xudong Jiang, and Henghui Ding. Vgsg: Vision-guided semantic-group network for text- based person search.IEEE Transactions on Image Process- ing, 33:163–176, 2023. 5

  11. [11]

    Empowering visible- infrared person re-identification with large foundation mod- els.Advances in Neural Information Processing Systems, 37: 117363–117387, 2024

    Zhangyi Hu, Bin Yang, and Mang Ye. Empowering visible- infrared person re-identification with large foundation mod- els.Advances in Neural Information Processing Systems, 37: 117363–117387, 2024. 1, 2, 3, 4, 6

  12. [12]

    Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval

    Ding Jiang and Mang Ye. Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2787–2797, 2023. 2, 5, 6 8

  13. [13]

    Modeling thousands of hu- man annotators for generalizable text-to-image person re- identification.arXiv preprint arXiv:2503.09962, 2025

    Jiayu Jiang, Changxing Ding, Wentao Tan, Junhong Wang, Jin Tao, and Xiangmin Xu. Modeling thousands of hu- man annotators for generalizable text-to-image person re- identification.arXiv preprint arXiv:2503.09962, 2025. 1, 2, 5, 6

  14. [14]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1, 2

  15. [15]

    Learning semantic polymorphic mapping for text-based per- son retrieval.IEEE Transactions on Multimedia, 2024

    Jiayi Li, Min Jiang, Jun Kong, Xuefeng Tao, and Xi Luo. Learning semantic polymorphic mapping for text-based per- son retrieval.IEEE Transactions on Multimedia, 2024. 5

  16. [16]

    Person search with natural lan- guage description

    Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. Person search with natural lan- guage description. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1970–1979,

  17. [17]

    Dcel: deep cross-modal evi- dential learning for text-based person retrieval

    Shenshen Li, Xing Xu, Yang Yang, Fumin Shen, Yijun Mo, Yujie Li, and Heng Tao Shen. Dcel: deep cross-modal evi- dential learning for text-based person retrieval. InProceed- ings of the 31st ACM International Conference on Multime- dia, pages 6292–6300, 2023. 2, 5

  18. [18]

    Adaptive uncertainty-based learning for text-based person retrieval

    Shenshen Li, Chen He, Xing Xu, Fumin Shen, Yang Yang, and Heng Tao Shen. Adaptive uncertainty-based learning for text-based person retrieval. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 3172–3180, 2024. 5

  19. [19]

    Text and image are mutually beneficial: Enhancing training-free few-shot classification with clip

    Yayuan Li, Jintao Guo, Lei Qi, Wenbin Li, and Yinghuan Shi. Text and image are mutually beneficial: Enhancing training-free few-shot classification with clip. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 5039–5047, 2025. 2

  20. [20]

    Cross-modal adaptive dual association for text-to- image person retrieval.IEEE Transactions on Multimedia, 26:6609–6620, 2024

    Dixuan Lin, Yi-Xing Peng, Jingke Meng, and Wei-Shi Zheng. Cross-modal adaptive dual association for text-to- image person retrieval.IEEE Transactions on Multimedia, 26:6609–6620, 2024. 2, 5

  21. [21]

    Feature pyramid net- works for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid net- works for object detection. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2117–2125, 2017. 2

  22. [22]

    Try harder: Hard sample generation and learning for cloth-changing per- son re-id

    Hankun Liu, Yujian Zhao, and Guanglin Niu. Try harder: Hard sample generation and learning for cloth-changing per- son re-id. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 1704–1713, 2025. 2

  23. [23]

    Causality-inspired invariant representation learning for text-based person retrieval

    Yu Liu, Guihe Qin, Haipeng Chen, Zhiyong Cheng, and Xun Yang. Causality-inspired invariant representation learning for text-based person retrieval. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 14052–14060, 2024. 5

  24. [24]

    Looking alike from far to near: Enhancing cross- resolution re-identification via feature vector panning.arXiv preprint arXiv:2510.00936, 2025

    Zanwu Liu, Chao Yuan, Bo Li, Xiaowei Zhang, and Guan- glin Niu. Looking alike from far to near: Enhancing cross- resolution re-identification via feature vector panning.arXiv preprint arXiv:2510.00936, 2025. 2

  25. [25]

    Omniperson: Unified identity-preserving pedestrian generation.arXiv preprint arXiv:2512.02554,

    Changxiao Ma, Chao Yuan, Xincheng Shi, Yuzhuo Ma, Yongfei Zhang, Longkun Zhou, Yujia Zhang, Shangze Li, and Yifan Xu. Omniperson: Unified identity-preserving pedestrian generation.arXiv preprint arXiv:2512.02554,

  26. [26]

    Gpt-4o: Openai’s multimodal language model,

    OpenAI. Gpt-4o: Openai’s multimodal language model,

  27. [27]

    Plot: Text-based person search with part slot attention for corresponding part discovery

    Jicheol Park, Dongwon Kim, Boseung Jeong, and Suha Kwak. Plot: Text-based person search with part slot attention for corresponding part discovery. InEuropean Conference on Computer Vision, pages 474–490. Springer, 2024. 5

  28. [28]

    Noisy-correspondence learning for text-to-image person re-identification

    Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, and Peng Hu. Noisy-correspondence learning for text-to-image person re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27197–27206, 2024. 2, 5, 6

  29. [29]

    Yuxuan Qiu, Liyang Wang, Wei Song, Jiawei Liu, Zhip- ing Shi, and Na Jiang. Advancing visible-infrared person re-identification: Synergizing visual-textual reasoning and cross-modal feature alignment.IEEE Transactions on Infor- mation Forensics and Security, 20:2184–2196, 2025. 2

  30. [30]

    Language models are unsuper- vised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsuper- vised multitask learners.OpenAI blog, 1(8):9, 2019. 2

  31. [31]

    Learn- ing transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 4

  32. [32]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 10684–10695, 2022. 2

  33. [33]

    Adversarial representation learning for text-to-image match- ing

    Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. Adversarial representation learning for text-to-image match- ing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 5814–5824, 2019. 2

  34. [34]

    Learning granularity-unified representations for text-to-image person re-identification

    Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. Learning granularity-unified representations for text-to-image person re-identification. In Proceedings of the 30th acm international conference on mul- timedia, pages 5566–5574, 2022. 5

  35. [35]

    Unified pre-training with pseudo texts for text-to-image person re-identification

    Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, and Jingdong Wang. Unified pre-training with pseudo texts for text-to-image person re-identification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11174–11184, 2023. 2, 5

  36. [36]

    See finer, see more: Implicit modality alignment for text-based person retrieval

    Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. See finer, see more: Implicit modality alignment for text-based person retrieval. InEuropean Conference on Computer Vision, pages 624–

  37. [37]

    From data deluge to data curation: A filtering-wora paradigm for efficient text-based person search

    Jintao Sun, Hao Fei, Gangyi Ding, and Zhedong Zheng. From data deluge to data curation: A filtering-wora paradigm for efficient text-based person search. InTHE WEB CON- FERENCE 2025. 5

  38. [38]

    Harnessing the power of mllms for transferable text-to-image person reid

    Wentan Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yib- ing Zhan, and Dapeng Tao. Harnessing the power of mllms for transferable text-to-image person reid. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17127–17137, 2024. 2

  39. [39]

    Fine-grained semantics- aware representation learning for text-based person retrieval

    Di Wang, Feng Yan, Yifeng Wang, Lin Zhao, Xiao Liang, Haodi Zhong, and Ronghua Zhang. Fine-grained semantics- aware representation learning for text-based person retrieval. InProceedings of the 2024 International Conference on Mul- timedia Retrieval, pages 92–100, 2024. 5 9

  40. [40]

    High-order information matters: Learning relation and topology for occluded person re-identification

    Guan’an Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjin Zhou, and Jian Sun. High-order information matters: Learning relation and topology for occluded person re-identification. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6449–6458, 2020. 2

  41. [41]

    When large vision-language models meet person re-identification,

    Qizao Wang, Bin Li, and Xiangyang Xue. When large vision-language models meet person re-identification.arXiv preprint arXiv:2411.18111, 2024. 2

  42. [42]

    Vi- taa: Visual-textual attributes alignment in person search by natural language

    Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. Vi- taa: Visual-textual attributes alignment in person search by natural language. InComputer vision–ECCV 2020: 16th Eu- ropean conference, glasgow, UK, August 23–28, 2020, pro- ceedings, part XII 16, pages 402–420. Springer, 2020. 2

  43. [43]

    Caibc: Capturing all-round infor- mation beyond color for text-based person retrieval

    Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. Caibc: Capturing all-round infor- mation beyond color for text-based person retrieval. InPro- ceedings of the 30th ACM international conference on multi- media, pages 5314–5322, 2022. 5

  44. [44]

    Grok 3 beta — the age of reasoning agents.https: //x.ai/blog/grok-3, 2025

    xAI. Grok 3 beta — the age of reasoning agents.https: //x.ai/blog/grok-3, 2025. 3, 7

  45. [45]

    Lv-reid: Large language-vision alignment model for text-based per- son re-identification

    Yinghui Xia, Chao Wang, and Jinsong Yang. Lv-reid: Large language-vision alignment model for text-based per- son re-identification. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1–5. IEEE, 2025. 2

  46. [46]

    arXiv preprint arXiv:2603.14243 (2026)

    Haoxuan Xu and Guanglin Niu. Bit: Matching- based bi-directional interaction transformation network for visible-infrared person re-identification.arXiv preprint arXiv:2603.14243, 2026. 2

  47. [47]

    Identity-aware feature decoupling learning for clothing-change person re- identification

    Haoxuan Xu, Bo Li, and Guanglin Niu. Identity-aware feature decoupling learning for clothing-change person re- identification. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  48. [48]

    CMCC-ReID: Cross-Modality Clothing-Change Person Re-Identification

    Haoxuan Xu, Hanzi Wang, and Guanglin Niu. Cmcc-reid: Cross-modality clothing-change person re-identification. arXiv preprint arXiv:2604.02808, 2026. 2

  49. [49]

    Learning comprehensive representations with richer self for text-to-image person re-identification

    Shuanglin Yan, Neng Dong, Jun Liu, Liyan Zhang, and Jinhui Tang. Learning comprehensive representations with richer self for text-to-image person re-identification. InProceed- ings of the 31st ACM international conference on multimedia, pages 6202–6211, 2023. 5

  50. [50]

    Clip-driven fine-grained text-image person re-identification

    Shuanglin Yan, Neng Dong, Liyan Zhang, and Jinhui Tang. Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing, 32:6032–6046,

  51. [51]

    Prototypical prompting for text-to-image person re- identification

    Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, and Jinhui Tang. Prototypical prompting for text-to-image person re- identification. InProceedings of the 32nd ACM International Conference on Multimedia, pages 2331–2340, 2024. 5

  52. [52]

    Mllmreid: multimodal large language model-based person re-identification.arXiv preprint arXiv:2401.13201, 2024

    Shan Yang and Yongfei Zhang. Mllmreid: multimodal large language model-based person re-identification.arXiv preprint arXiv:2401.13201, 2024. 2

  53. [53]

    Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark

    Shuyu Yang, Yinan Zhou, Zhedong Zheng, Yaxiong Wang, Li Zhu, and Yujiao Wu. Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. InProceedings of the 31st ACM International Conference on Multimedia, pages 4492–4501, 2023. 2, 5

  54. [54]

    Modality- transition representation learning for visible-infrared person re-identification.arXiv preprint arXiv:2511.02685, 2025

    Chao Yuan, Zanwu Liu, Guiwei Zhang, Haoxuan Xu, Yujian Zhao, Guanglin Niu, and Bo Li. Modality- transition representation learning for visible-infrared person re-identification.arXiv preprint arXiv:2511.02685, 2025. 2

  55. [55]

    From poses to identity: Training-free per- son re-identification via feature centralization

    Chao Yuan, Guiwei Zhang, Changxiao Ma, Tianyi Zhang, and Guanglin Niu. From poses to identity: Training-free per- son re-identification via feature centralization. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 24409–24418, 2025. 3

  56. [56]

    Neighbor- based feature and index enhancement for person re- identification

    Chao Yuan, Tianyi Zhang, and Guanglin Niu. Neighbor- based feature and index enhancement for person re- identification. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5762–5769, 2025. 3

  57. [57]

    Deep cross-modal projection learning for image-text matching

    Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text matching. InProceedings of the Euro- pean conference on computer vision (ECCV), pages 686–701,

  58. [58]

    Mos: Mit- igating optical-sar modality gap for cross-modal ship re- identification.arXiv preprint arXiv:2512.03404, 2025

    Yujian Zhao, Hankun Liu, and Guanglin Niu. Mos: Mit- igating optical-sar modality gap for cross-modal ship re- identification.arXiv preprint arXiv:2512.03404, 2025. 2

  59. [59]

    Ccup: A controllable synthetic data generation pipeline for pretraining cloth-changing per- son re-identification models

    Yujian Zhao, Chengru Wu, Yinong Xu, Xuanzheng Du, Ruiyu Li, and Guanglin Niu. Ccup: A controllable synthetic data generation pipeline for pretraining cloth-changing per- son re-identification models. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 2

  60. [60]

    Uni- fying multi-modal uncertainty modeling and semantic align- ment for text-to-image person re-identification

    Zhiwei Zhao, Bin Liu, Yan Lu, Qi Chu, and Nenghai Yu. Uni- fying multi-modal uncertainty modeling and semantic align- ment for text-to-image person re-identification. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 7534–7542, 2024. 5

  61. [61]

    Dual-path convolutional image-text embeddings with instance loss.ACM Transac- tions on Multimedia Computing, Communications, and Ap- plications (TOMM), 16(2):1–23, 2020

    Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. Dual-path convolutional image-text embeddings with instance loss.ACM Transac- tions on Multimedia Computing, Communications, and Ap- plications (TOMM), 16(2):1–23, 2020. 2

  62. [62]

    Dssl: Deep surroundings-person separation learning for text-based per- son retrieval

    Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. Dssl: Deep surroundings-person separation learning for text-based per- son retrieval. InProceedings of the 29th ACM international conference on multimedia, pages 209–217, 2021. 1, 2, 5

  63. [63]

    Plip: Language-image pre-training for person repre- sentation learning.Advances in Neural Information Process- ing Systems, 37:45666–45702, 2024

    Jialong Zuo, Jiahao Hong, Feng Zhang, Changqian Yu, Hanyu Zhou, Changxin Gao, Nong Sang, and Jingdong Wang. Plip: Language-image pre-training for person repre- sentation learning.Advances in Neural Information Process- ing Systems, 37:45666–45702, 2024. 1, 2

  64. [64]

    Ufinebench: Towards text-based person retrieval with ultra- fine granularity

    Jialong Zuo, Hanyu Zhou, Ying Nie, Feng Zhang, Tianyu Guo, Nong Sang, Yunhe Wang, and Changxin Gao. Ufinebench: Towards text-based person retrieval with ultra- fine granularity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22010– 22019, 2024. 5 10 Towards Robust Text-to-Image Person Retrieval: Multi-View Refor...

  65. [65]

    Computation cost For each generation, for DeepSeek-V3 as example, it re- quires approximately 0.246 TFLOPs per token, making it highly efficient for multi-round generation. In our setting, producing 15 reformulations incurs a total cost equivalent to only a few billion floating-point operations, which can be completed within 0.5 seconds either on a local ...

  66. [66]

    Prompts 7.1. Our prompt withP key Prompt withP key System: Instructions: Suppose you now have a picture of a pedestrian, I will give you a caption and its key words list, your task is to rewrite the caption. - Contains every key word must be used, but can change order and replace other words in a similar meaning. - Give me 15 different captions and return...

  67. [67]

    Query compensation withP key Example Input: Caption: This man has short black hair and wears a suit jacket,black trousers with a pair of sneakers.He is looking at left side

    Examples 8.1. Query compensation withP key Example Input: Caption: This man has short black hair and wears a suit jacket,black trousers with a pair of sneakers.He is looking at left side. Keywords: ’man’, ’short’, ’black’, ’hair’, ’suit’, ’jacket’, ’trousers’, ’sneakers’, ’looking’, ’left’. Output: [ - A man with short black hair is dressed in a suit jack...