FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning

Haokun Wen; Weili Guan; Xiangyu Zhao; Xiaolin Chen; Xinghao Xie; Xuemeng Song

arxiv: 2605.22552 · v1 · pith:PZHAAFNBnew · submitted 2026-05-21 · 💻 cs.CV · cs.MM

FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning

Haokun Wen , Xuemeng Song , Xinghao Xie , Xiaolin Chen , Xiangyu Zhao , Weili Guan This is my paper

Pith reviewed 2026-05-22 05:58 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords fashion image retrievalmultimodal large language modelstask-adaptive learningunified retrieval frameworkquery calibrationadaptive samplingbenchmark dataset

0 comments

The pith

FashionLens unifies fashion image retrieval across diverse tasks using adaptive query calibration and sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified model for handling many different kinds of fashion image searches, such as different query types and user intentions, which current systems handle separately. It creates a new combined dataset called U-FIRE to test this and adds two special test sets for checking how well the model works on new tasks. The key innovations are a method to adjust the query representation to fit the specific task using spherical adjustments and a way to sample training data that accounts for how hard each task is and how much data it has. This matters because it could let online stores use one system for all kinds of product searches instead of many specialized ones.

Core claim

FashionLens is proposed as a unified framework based on Multimodal Large Language Models for versatile fashion image retrieval. It incorporates a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation to handle divergent matching objectives, and a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior to mitigate optimization imbalance. Experiments demonstrate state-of-the-art performance across diverse retrieval scenarios on the U-FIRE benchmark and robust generalization to unseen tasks.

What carries the argument

The Proposal-Guided Spherical Query Calibrator, which uses adaptive spherical linear interpolation to align query representations with task-specific metric spaces, and the Gradient-Guided Adaptive Sampling, which re-weights tasks according to learning difficulty and data scale.

If this is right

Achieves state-of-the-art results on a wide range of fashion retrieval tasks in the U-FIRE benchmark.
Generalizes effectively to retrieval tasks not seen during training.
Supports multiple query formats and search intentions within a single model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adaptive calibration technique might help in other areas where different tasks have conflicting goals, such as general visual search.
Creating a unified benchmark like U-FIRE could push the field toward more general-purpose retrieval systems rather than task-specific ones.
Real-world e-commerce platforms could reduce the number of separate models needed for different search features.

Load-bearing premise

The two proposed components, the query calibrator and the adaptive sampling strategy, can successfully resolve differences in matching goals and training imbalances between tasks.

What would settle it

A direct comparison showing that FashionLens does not outperform previous methods on multiple retrieval scenarios in U-FIRE or fails to perform well on the held-out generalization datasets would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.22552 by Haokun Wen, Weili Guan, Xiangyu Zhao, Xiaolin Chen, Xinghao Xie, Xuemeng Song.

**Figure 2.** Figure 2: Examples of the two unseen evaluation tasks. Blue text denotes the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of FashionLens, which leverages MLLMs as the backbone. It includes two key novel components: 1) Proposal-Guided Spherical Query [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Visualization of task difficulty measured by gradient norms during [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper consolidates fashion datasets into U-FIRE and adds two adaptive modules to an MLLM for multi-task retrieval, but the modules' effectiveness stays unverified without ablations or numbers.

read the letter

Hi, the core of this paper is a new unified benchmark U-FIRE that pulls together existing fashion datasets plus two fresh test sets, paired with a multimodal LLM framework that uses a Proposal-Guided Spherical Query Calibrator and Gradient-Guided Adaptive Sampling to handle different query types and task difficulties in one model. That setup directly targets the practical mess of e-commerce search where users mix text, images, and attributes. Consolidating the data and releasing code plus data counts as useful groundwork, and the spherical interpolation plus gradient-based reweighting are concrete attempts to fix mismatched objectives and scale imbalances rather than just fine-tuning harder. Those choices show some thought about how MLLMs actually behave across tasks. The soft spot is the evidence. The abstract claims SOTA results and strong generalization on U-FIRE, yet no metrics, baselines, or component ablations appear in the provided material, which leaves the stress-test concern intact: we cannot tell whether the calibrator and sampler are doing the claimed work or whether gains come from other factors. If the full paper contains detailed tables and error analysis, that would change the picture; right now the central claims rest on unshown experiments. This work is aimed at retrieval researchers and e-commerce teams who need one system instead of separate models for each intent. Readers working on adaptive multimodal training might pick up design ideas even if they adapt the pieces. It deserves a serious referee because the problem is real and the framing is reasonable, though any review would need to press hard on the missing quantitative support before acceptance.

Referee Report

2 major / 2 minor

Summary. The paper introduces U-FIRE, a consolidated benchmark merging fragmented fashion retrieval datasets plus two new manually curated sets for generalization testing. It proposes FashionLens, an MLLM-based unified framework that adds a Proposal-Guided Spherical Query Calibrator (adaptive spherical linear interpolation to shift queries into task-aligned spaces) and a Gradient-Guided Adaptive Sampling strategy (re-weighting tasks by realtime difficulty and data-scale prior) to handle divergent objectives and optimization imbalance. Experiments are claimed to show SOTA performance across diverse scenarios on U-FIRE together with robust generalization to unseen tasks.

Significance. If the empirical results hold, the work would be significant for moving fashion retrieval from narrow task-specific models toward a single versatile framework, which is practically relevant for e-commerce. The creation of U-FIRE as a unified benchmark is a concrete community resource. The two proposed modules constitute a plausible attempt to address multi-task optimization challenges that are common in retrieval settings.

major comments (2)

[Experiments] Experiments section: the central SOTA and generalization claims rest on the two new modules, yet no ablation numbers, per-component metric deltas, baseline comparisons, or analysis under varying task complexities and data scales are supplied. This leaves the load-bearing assertion that the Proposal-Guided Spherical Query Calibrator and Gradient-Guided Adaptive Sampling resolve divergent matching objectives and optimization imbalance as an unverified assumption rather than a demonstrated result.
[§3.2] §3.2 (Proposal-Guided Spherical Query Calibrator): the description of adaptive spherical linear interpolation is given, but without quantitative evidence showing improved task alignment or retrieval metrics when the module is ablated, it is impossible to confirm that the mechanism actually mitigates the stated objective mismatch.

minor comments (2)

[§3.2] Clarify whether the spherical interpolation is performed in the original embedding space or a projected space, and provide the exact interpolation formula.
[§4] Add a table or figure that explicitly lists the constituent datasets of U-FIRE together with their sizes and task types.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and agree that additional experimental validation is needed to strengthen the claims regarding our proposed modules.

read point-by-point responses

Referee: [Experiments] Experiments section: the central SOTA and generalization claims rest on the two new modules, yet no ablation numbers, per-component metric deltas, baseline comparisons, or analysis under varying task complexities and data scales are supplied. This leaves the load-bearing assertion that the Proposal-Guided Spherical Query Calibrator and Gradient-Guided Adaptive Sampling resolve divergent matching objectives and optimization imbalance as an unverified assumption rather than a demonstrated result.

Authors: We acknowledge that the current experiments section emphasizes overall SOTA performance on U-FIRE without providing detailed component-wise ablations, metric deltas, or analyses across task complexities and data scales. This is a valid point, and the manuscript would benefit from such evidence. In the revised version, we will add comprehensive ablation studies, including per-component contributions with quantitative deltas, baseline comparisons, and evaluations under varying task complexities and data scales to demonstrate how the Proposal-Guided Spherical Query Calibrator and Gradient-Guided Adaptive Sampling address divergent objectives and optimization imbalance. revision: yes
Referee: [§3.2] §3.2 (Proposal-Guided Spherical Query Calibrator): the description of adaptive spherical linear interpolation is given, but without quantitative evidence showing improved task alignment or retrieval metrics when the module is ablated, it is impossible to confirm that the mechanism actually mitigates the stated objective mismatch.

Authors: We agree that quantitative ablation evidence is required to substantiate the role of the Proposal-Guided Spherical Query Calibrator. The current manuscript describes the adaptive spherical linear interpolation but lacks the corresponding ablation metrics. We will incorporate ablation experiments in the revision that report improvements in task alignment and retrieval metrics when the module is included versus ablated, thereby confirming its contribution to mitigating objective mismatch. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces U-FIRE as a consolidated benchmark and proposes FashionLens as a new MLLM-based framework with two explicitly designed modules (Proposal-Guided Spherical Query Calibrator using adaptive spherical linear interpolation, and Gradient-Guided Adaptive Sampling for re-weighting). Performance claims rest on experimental results across retrieval scenarios and generalization tests rather than any mathematical derivation that reduces to fitted inputs or self-referential definitions. No equations or steps in the provided text exhibit self-definitional loops, fitted parameters called predictions, or load-bearing self-citations that force the central claims by construction. The framework builds on standard multimodal models with independent new components, making the derivation self-contained and empirically grounded.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions from multimodal learning and retrieval literature with no explicitly listed free parameters, invented entities, or non-standard axioms in the abstract.

axioms (1)

domain assumption Multimodal Large Language Models provide a suitable base for fine-tuning on retrieval tasks with divergent objectives.
The framework is built directly on MLLMs as stated in the abstract.

pith-pipeline@v0.9.0 · 5774 in / 1254 out tokens · 43451 ms · 2026-05-22T05:58:24.415328+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

[1]

Neural compat- ibility modeling with attentive knowledge distillation,

X. Song, F. Feng, X. Han, X. Yang, W. Liu, and L. Nie, “Neural compat- ibility modeling with attentive knowledge distillation,” inProceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2018, pp. 5–14

work page 2018
[2]

Category-aware mul- timodal attention network for fashion compatibility modeling,

P. Jing, K. Cui, W. Guan, L. Nie, and Y . Su, “Category-aware mul- timodal attention network for fashion compatibility modeling,”IEEE Transactions on Multimedia, vol. 25, pp. 9120–9131, 2023

work page 2023
[3]

Multi-modal and multi-domain embedding learning for fashion retrieval and analysis,

X. Gu, Y . Wong, L. Shou, P. Peng, G. Chen, and M. S. Kankanhalli, “Multi-modal and multi-domain embedding learning for fashion retrieval and analysis,”IEEE Transactions on Multimedia, vol. 21, no. 6, pp. 1524–1537, 2019

work page 2019
[4]

A multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and expo- nential moving average distillation,

L. Xiao and T. Yamasaki, “A multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and expo- nential moving average distillation,”IEEE Transactions on Multimedia, pp. 1–10, 2026

work page 2026
[5]

Sketch me that shoe,

Q. Yu, F. Liu, Y . Song, T. Xiang, T. M. Hospedales, and C. C. Loy, “Sketch me that shoe,” inIEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2016, pp. 799–807

work page 2016
[6]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,

Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” inIEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2016, pp. 1096–1104

work page 2016
[7]

Movingfashion: a benchmark for the video-to-shop challenge,

M. Godi, C. Joppi, G. Skenderi, and M. Cristani, “Movingfashion: a benchmark for the video-to-shop challenge,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 2022, pp. 517–525

work page 2022
[8]

Fashionbert: Text and image matching with adaptive loss for cross- modal retrieval,

D. Gao, L. Jin, B. Chen, M. Qiu, P. Li, Y . Wei, Y . Hu, and H. Wang, “Fashionbert: Text and image matching with adaptive loss for cross- modal retrieval,” inProceedings of the International ACM SIGIR Con- ference on Research and Development in Information Retrieval. ACM, 2020, pp. 2251–2260

work page 2020
[9]

Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks,

X. Han, X. Zhu, L. Yu, L. Zhang, Y . Song, and T. Xiang, “Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2023, pp. 2669–2680

work page 2023
[10]

Fash- ionsap: Symbols and attributes prompt for fine-grained fashion vision- language pre-training,

Y . Han, L. Zhang, Q. Chen, Z. Chen, Z. Li, J. Yang, and Z. Cao, “Fash- ionsap: Symbols and attributes prompt for fine-grained fashion vision- language pre-training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2023, pp. 15 028– 15 038

work page 2023
[11]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763

work page 2021
[12]

BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900

work page 2022
[13]

Uniir: Training and benchmarking universal multimodal information retrievers,

C. Wei, Y . Chen, H. Chen, H. Hu, G. Zhang, J. Fu, A. Ritter, and W. Chen, “Uniir: Training and benchmarking universal multimodal information retrievers,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 387–404

work page 2024
[14]

Vlm2vec: Training vision-language models for massive multimodal embedding tasks,

Z. Jiang, R. Meng, X. Yang, S. Yavuz, Y . Zhou, and W. Chen, “Vlm2vec: Training vision-language models for massive multimodal embedding tasks,” inProceedings of the International Conference on Learning Representations. OpenReview.net, 2025

work page 2025
[15]

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

R. Meng, Z. Jiang, Y . Liu, M. Su, X. Yang, Y . Fu, C. Qin, Z. Chen, R. Xu, C. Xiong, Y . Zhou, W. Chen, and S. Yavuz, “Vlm2vec- v2: Advancing multimodal embedding for videos, images, and visual documents,”CoRR, vol. abs/2507.04590, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Loss-balanced task weighting to reduce negative transfer in multi-task learning,

S. Liu, Y . Liang, and A. Gitter, “Loss-balanced task weighting to reduce negative transfer in multi-task learning,” inProceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 2019, pp. 9977– 9978

work page 2019
[17]

Feature decomposition for reducing negative transfer: A novel multi-task learning method for recommender system (student abstract),

J. Zhou, Q. Yu, C. Luo, and J. Zhang, “Feature decomposition for reducing negative transfer: A novel multi-task learning method for recommender system (student abstract),” inProceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 2023, pp. 16 390– 16 391

work page 2023
[18]

Dual alignment-enhanced fashion vision-language pre-training,

W. Guan, K. Wang, X. Song, K. Zhang, X. Chang, and S. Zhang, “Dual alignment-enhanced fashion vision-language pre-training,”ACM Trans- actions on Multimedia Computing, Communications and Applications, just Accepted

work page
[19]

Fashion-Gen: The Generative Fashion Dataset and Challenge

N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y . Zhang, C. Jauvin, and C. Pal, “Fashion-gen: The generative fashion dataset and challenge,”CoRR, vol. abs/1806.08317, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Fashion IQ: A new dataset towards retrieving images by natural language feedback,

H. Wu, Y . Gao, X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, and R. Feris, “Fashion IQ: A new dataset towards retrieving images by natural language feedback,” inIEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, 2021, pp. 11 307–11 317

work page 2021
[21]

Bridging modalities: Improving universal multi- modal retrieval by multimodal large language models,

X. Zhang, Y . Zhang, W. Xie, M. Li, Z. Dai, D. Long, P. Xie, M. Zhang, W. Li, and M. Zhang, “Bridging modalities: Improving universal multi- modal retrieval by multimodal large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2025, pp. 9274–9285

work page 2025
[22]

Mm- embed: Universal multimodal retrieval with multimodal LLMS,

S. Lin, C. Lee, M. Shoeybi, J. Lin, B. Catanzaro, and W. Ping, “Mm- embed: Universal multimodal retrieval with multimodal LLMS,” inPro- ceedings of the International Conference on Learning Representations. OpenReview.net, 2025

work page 2025
[23]

Simple but effective raw-data level multimodal fusion for composed image retrieval,

H. Wen, X. Song, X. Chen, Y . Wei, L. Nie, and T. Chua, “Simple but effective raw-data level multimodal fusion for composed image retrieval,” inProceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2024, pp. 229–239

work page 2024
[24]

Modeling instant user intent and content-level transition for sequential fashion recommendation,

Y . Ding, Y . Ma, W. K. Wong, and T. Chua, “Modeling instant user intent and content-level transition for sequential fashion recommendation,” IEEE Transactions on Multimedia, vol. 24, pp. 2687–2700, 2022

work page 2022
[25]

Learning fashion compatibility with context conditioning embedding,

Z. Lu, Y . Hu, C. Yu, Y . Chen, and B. Zeng, “Learning fashion compatibility with context conditioning embedding,”IEEE Transactions on Multimedia, vol. 25, pp. 5516–5526, 2023

work page 2023
[26]

Dialog-based interactive image retrieval,

X. Guo, H. Wu, Y . Cheng, S. Rennie, G. Tesauro, and R. S. Feris, “Dialog-based interactive image retrieval,” inAdvances in Neural Infor- mation Processing Systems, 2018, pp. 676–686

work page 2018
[27]

HAIFIT: human-centered AI for fashion image translation,

J. Jiang, X. Li, W. Yu, and D. Wu, “HAIFIT: human-centered AI for fashion image translation,”CoRR, vol. abs/2403.08651, 2024

work page arXiv 2024
[28]

Simple yet efficient: Towards self-supervised FG-SBIR with unified sample feature alignment,

J. Jiang, D. Wu, Z. Jiang, and W. Yu, “Simple yet efficient: Towards self-supervised FG-SBIR with unified sample feature alignment,”CoRR, vol. abs/2406.11551, 2024

work page arXiv 2024
[29]

Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images,

Y . Ge, R. Zhang, X. Wang, X. Tang, and P. Luo, “Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, 2019, pp. 5337–5345

work page 2019
[30]

Composite sketch+text queries for retrieving objects with elusive names and com- plex interactions,

P. Gatti, K. Parikh, D. P. Paul, M. Gupta, and A. Mishra, “Composite sketch+text queries for retrieving objects with elusive names and com- plex interactions,” inAAAI Conference on Artificial Intelligence. AAAI Press, 2024, pp. 1869–1877

work page 2024
[31]

Automatic spatially-aware fashion concept discovery,

X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y . Li, Y . Zhao, and L. S. Davis, “Automatic spatially-aware fashion concept discovery,” in IEEE International Conference on Computer Vision. IEEE Computer Society, 2017, pp. 1472–1480

work page 2017
[32]

Neurostylist: Neural compatibility modeling for clothing matching,

X. Song, F. Feng, J. Liu, Z. Li, L. Nie, and J. Ma, “Neurostylist: Neural compatibility modeling for clothing matching,” inProceedings of the ACM on Multimedia Conference. ACM, 2017, pp. 753–761

work page 2017
[33]

Learning fashion compatibil- ity with bidirectional lstms,

X. Han, Z. Wu, Y . Jiang, and L. S. Davis, “Learning fashion compatibil- ity with bidirectional lstms,” inProceedings of the ACM on Multimedia Conference. ACM, 2017, pp. 1078–1086

work page 2017
[34]

Fashionai: A hierarchical dataset for fashion understanding,

X. Zou, X. Kong, W. Wong, C. Wang, Y . Liu, and Y . Cao, “Fashionai: A hierarchical dataset for fashion understanding,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Computer Vision Foundation / IEEE, 2019, pp. 296–304

work page 2019
[35]

Cross-domain image retrieval with a dual attribute-aware ranking network,

J. Huang, R. S. Feris, Q. Chen, and S. Yan, “Cross-domain image retrieval with a dual attribute-aware ranking network,” inProceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2015, pp. 1062–1070

work page 2015
[36]

Qwen3-VL Technical Report

Q. Team, “Qwen3-vl technical report,”CoRR, vol. abs/2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inPro- ceedings of the International Conference on Learning Representations. OpenReview.net, 2022

work page 2022
[38]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProceedings of the International Conference on Learning Represen- tations. OpenReview.net, 2019

work page 2019
[39]

Scaling deep contrastive learning batch size under memory limited setup,

L. Gao, Y . Zhang, J. Han, and J. Callan, “Scaling deep contrastive learning batch size under memory limited setup,” inProceedings of the Workshop on Representation Learning for NLP. Association for Computational Linguistics, 2021, pp. 316–321

work page 2021

[1] [1]

Neural compat- ibility modeling with attentive knowledge distillation,

X. Song, F. Feng, X. Han, X. Yang, W. Liu, and L. Nie, “Neural compat- ibility modeling with attentive knowledge distillation,” inProceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2018, pp. 5–14

work page 2018

[2] [2]

Category-aware mul- timodal attention network for fashion compatibility modeling,

P. Jing, K. Cui, W. Guan, L. Nie, and Y . Su, “Category-aware mul- timodal attention network for fashion compatibility modeling,”IEEE Transactions on Multimedia, vol. 25, pp. 9120–9131, 2023

work page 2023

[3] [3]

Multi-modal and multi-domain embedding learning for fashion retrieval and analysis,

X. Gu, Y . Wong, L. Shou, P. Peng, G. Chen, and M. S. Kankanhalli, “Multi-modal and multi-domain embedding learning for fashion retrieval and analysis,”IEEE Transactions on Multimedia, vol. 21, no. 6, pp. 1524–1537, 2019

work page 2019

[4] [4]

A multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and expo- nential moving average distillation,

L. Xiao and T. Yamasaki, “A multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and expo- nential moving average distillation,”IEEE Transactions on Multimedia, pp. 1–10, 2026

work page 2026

[5] [5]

Sketch me that shoe,

Q. Yu, F. Liu, Y . Song, T. Xiang, T. M. Hospedales, and C. C. Loy, “Sketch me that shoe,” inIEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2016, pp. 799–807

work page 2016

[6] [6]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,

Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” inIEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2016, pp. 1096–1104

work page 2016

[7] [7]

Movingfashion: a benchmark for the video-to-shop challenge,

M. Godi, C. Joppi, G. Skenderi, and M. Cristani, “Movingfashion: a benchmark for the video-to-shop challenge,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 2022, pp. 517–525

work page 2022

[8] [8]

Fashionbert: Text and image matching with adaptive loss for cross- modal retrieval,

D. Gao, L. Jin, B. Chen, M. Qiu, P. Li, Y . Wei, Y . Hu, and H. Wang, “Fashionbert: Text and image matching with adaptive loss for cross- modal retrieval,” inProceedings of the International ACM SIGIR Con- ference on Research and Development in Information Retrieval. ACM, 2020, pp. 2251–2260

work page 2020

[9] [9]

Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks,

X. Han, X. Zhu, L. Yu, L. Zhang, Y . Song, and T. Xiang, “Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2023, pp. 2669–2680

work page 2023

[10] [10]

Fash- ionsap: Symbols and attributes prompt for fine-grained fashion vision- language pre-training,

Y . Han, L. Zhang, Q. Chen, Z. Chen, Z. Li, J. Yang, and Z. Cao, “Fash- ionsap: Symbols and attributes prompt for fine-grained fashion vision- language pre-training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2023, pp. 15 028– 15 038

work page 2023

[11] [11]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763

work page 2021

[12] [12]

BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900

work page 2022

[13] [13]

Uniir: Training and benchmarking universal multimodal information retrievers,

C. Wei, Y . Chen, H. Chen, H. Hu, G. Zhang, J. Fu, A. Ritter, and W. Chen, “Uniir: Training and benchmarking universal multimodal information retrievers,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 387–404

work page 2024

[14] [14]

Vlm2vec: Training vision-language models for massive multimodal embedding tasks,

Z. Jiang, R. Meng, X. Yang, S. Yavuz, Y . Zhou, and W. Chen, “Vlm2vec: Training vision-language models for massive multimodal embedding tasks,” inProceedings of the International Conference on Learning Representations. OpenReview.net, 2025

work page 2025

[15] [15]

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

R. Meng, Z. Jiang, Y . Liu, M. Su, X. Yang, Y . Fu, C. Qin, Z. Chen, R. Xu, C. Xiong, Y . Zhou, W. Chen, and S. Yavuz, “Vlm2vec- v2: Advancing multimodal embedding for videos, images, and visual documents,”CoRR, vol. abs/2507.04590, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Loss-balanced task weighting to reduce negative transfer in multi-task learning,

S. Liu, Y . Liang, and A. Gitter, “Loss-balanced task weighting to reduce negative transfer in multi-task learning,” inProceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 2019, pp. 9977– 9978

work page 2019

[17] [17]

Feature decomposition for reducing negative transfer: A novel multi-task learning method for recommender system (student abstract),

J. Zhou, Q. Yu, C. Luo, and J. Zhang, “Feature decomposition for reducing negative transfer: A novel multi-task learning method for recommender system (student abstract),” inProceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 2023, pp. 16 390– 16 391

work page 2023

[18] [18]

Dual alignment-enhanced fashion vision-language pre-training,

W. Guan, K. Wang, X. Song, K. Zhang, X. Chang, and S. Zhang, “Dual alignment-enhanced fashion vision-language pre-training,”ACM Trans- actions on Multimedia Computing, Communications and Applications, just Accepted

work page

[19] [19]

Fashion-Gen: The Generative Fashion Dataset and Challenge

N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y . Zhang, C. Jauvin, and C. Pal, “Fashion-gen: The generative fashion dataset and challenge,”CoRR, vol. abs/1806.08317, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Fashion IQ: A new dataset towards retrieving images by natural language feedback,

H. Wu, Y . Gao, X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, and R. Feris, “Fashion IQ: A new dataset towards retrieving images by natural language feedback,” inIEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, 2021, pp. 11 307–11 317

work page 2021

[21] [21]

Bridging modalities: Improving universal multi- modal retrieval by multimodal large language models,

X. Zhang, Y . Zhang, W. Xie, M. Li, Z. Dai, D. Long, P. Xie, M. Zhang, W. Li, and M. Zhang, “Bridging modalities: Improving universal multi- modal retrieval by multimodal large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2025, pp. 9274–9285

work page 2025

[22] [22]

Mm- embed: Universal multimodal retrieval with multimodal LLMS,

S. Lin, C. Lee, M. Shoeybi, J. Lin, B. Catanzaro, and W. Ping, “Mm- embed: Universal multimodal retrieval with multimodal LLMS,” inPro- ceedings of the International Conference on Learning Representations. OpenReview.net, 2025

work page 2025

[23] [23]

Simple but effective raw-data level multimodal fusion for composed image retrieval,

H. Wen, X. Song, X. Chen, Y . Wei, L. Nie, and T. Chua, “Simple but effective raw-data level multimodal fusion for composed image retrieval,” inProceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2024, pp. 229–239

work page 2024

[24] [24]

Modeling instant user intent and content-level transition for sequential fashion recommendation,

Y . Ding, Y . Ma, W. K. Wong, and T. Chua, “Modeling instant user intent and content-level transition for sequential fashion recommendation,” IEEE Transactions on Multimedia, vol. 24, pp. 2687–2700, 2022

work page 2022

[25] [25]

Learning fashion compatibility with context conditioning embedding,

Z. Lu, Y . Hu, C. Yu, Y . Chen, and B. Zeng, “Learning fashion compatibility with context conditioning embedding,”IEEE Transactions on Multimedia, vol. 25, pp. 5516–5526, 2023

work page 2023

[26] [26]

Dialog-based interactive image retrieval,

X. Guo, H. Wu, Y . Cheng, S. Rennie, G. Tesauro, and R. S. Feris, “Dialog-based interactive image retrieval,” inAdvances in Neural Infor- mation Processing Systems, 2018, pp. 676–686

work page 2018

[27] [27]

HAIFIT: human-centered AI for fashion image translation,

J. Jiang, X. Li, W. Yu, and D. Wu, “HAIFIT: human-centered AI for fashion image translation,”CoRR, vol. abs/2403.08651, 2024

work page arXiv 2024

[28] [28]

Simple yet efficient: Towards self-supervised FG-SBIR with unified sample feature alignment,

J. Jiang, D. Wu, Z. Jiang, and W. Yu, “Simple yet efficient: Towards self-supervised FG-SBIR with unified sample feature alignment,”CoRR, vol. abs/2406.11551, 2024

work page arXiv 2024

[29] [29]

Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images,

Y . Ge, R. Zhang, X. Wang, X. Tang, and P. Luo, “Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, 2019, pp. 5337–5345

work page 2019

[30] [30]

Composite sketch+text queries for retrieving objects with elusive names and com- plex interactions,

P. Gatti, K. Parikh, D. P. Paul, M. Gupta, and A. Mishra, “Composite sketch+text queries for retrieving objects with elusive names and com- plex interactions,” inAAAI Conference on Artificial Intelligence. AAAI Press, 2024, pp. 1869–1877

work page 2024

[31] [31]

Automatic spatially-aware fashion concept discovery,

X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y . Li, Y . Zhao, and L. S. Davis, “Automatic spatially-aware fashion concept discovery,” in IEEE International Conference on Computer Vision. IEEE Computer Society, 2017, pp. 1472–1480

work page 2017

[32] [32]

Neurostylist: Neural compatibility modeling for clothing matching,

X. Song, F. Feng, J. Liu, Z. Li, L. Nie, and J. Ma, “Neurostylist: Neural compatibility modeling for clothing matching,” inProceedings of the ACM on Multimedia Conference. ACM, 2017, pp. 753–761

work page 2017

[33] [33]

Learning fashion compatibil- ity with bidirectional lstms,

X. Han, Z. Wu, Y . Jiang, and L. S. Davis, “Learning fashion compatibil- ity with bidirectional lstms,” inProceedings of the ACM on Multimedia Conference. ACM, 2017, pp. 1078–1086

work page 2017

[34] [34]

Fashionai: A hierarchical dataset for fashion understanding,

X. Zou, X. Kong, W. Wong, C. Wang, Y . Liu, and Y . Cao, “Fashionai: A hierarchical dataset for fashion understanding,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Computer Vision Foundation / IEEE, 2019, pp. 296–304

work page 2019

[35] [35]

Cross-domain image retrieval with a dual attribute-aware ranking network,

J. Huang, R. S. Feris, Q. Chen, and S. Yan, “Cross-domain image retrieval with a dual attribute-aware ranking network,” inProceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2015, pp. 1062–1070

work page 2015

[36] [36]

Qwen3-VL Technical Report

Q. Team, “Qwen3-vl technical report,”CoRR, vol. abs/2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inPro- ceedings of the International Conference on Learning Representations. OpenReview.net, 2022

work page 2022

[38] [38]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProceedings of the International Conference on Learning Represen- tations. OpenReview.net, 2019

work page 2019

[39] [39]

Scaling deep contrastive learning batch size under memory limited setup,

L. Gao, Y . Zhang, J. Han, and J. Callan, “Scaling deep contrastive learning batch size under memory limited setup,” inProceedings of the Workshop on Representation Learning for NLP. Association for Computational Linguistics, 2021, pp. 316–321

work page 2021