The 2nd EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval

Alexandros Karatzoglou; Ioannis Arapakis; Joemon M. Jose; Junchen Fu; Qian Li; Qijiong Liu; Xin Xin; Xi Wang; Xuri Ge

arxiv: 2605.26941 · v1 · pith:F65PK4MXnew · submitted 2026-05-26 · 💻 cs.IR · cs.MM

The 2nd EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval

Junchen Fu , Xuri Ge , Xin Xin , Alexandros Karatzoglou , Ioannis Arapakis , Xi Wang , Qijiong Liu , Qian Li

show 1 more author

Joemon M. Jose

This is my paper

Pith reviewed 2026-06-29 15:47 UTC · model grok-4.3

classification 💻 cs.IR cs.MM

keywords multimodal representation learninginformation retrievalefficiency bottlenecksfoundation modelsworkshopsbenchmarksmetrics

0 comments

The pith

Massive parameter counts in pretrained multimodal models create major efficiency bottlenecks when adapting their representations for IR tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper states that large foundation models such as Qwen, LLaVA, and CLIP achieve strong results on multimodal information retrieval tasks including web search, cross-modal retrieval, and recommender systems. Their size nevertheless imposes major costs during training, deployment, and inference that limit real-world use. The authors respond by proposing the second EReL@MIR workshop to convene researchers and produce new solutions, efficiency metrics, and benchmarks.

Core claim

The authors claim that the massive parameter counts of pretrained multimodal foundation models generate efficiency bottlenecks in training, deployment, and inference that hinder practical representation learning for information retrieval, and that a dedicated workshop is required to surface solutions and define appropriate metrics and benchmarks.

What carries the argument

The efficiency bottlenecks arising from high parameter counts when adapting pretrained multimodal models to IR tasks.

If this is right

New efficiency metrics tailored to multimodal IR will be defined and adopted.
Benchmarks will be established to evaluate representation learning methods under parameter constraints.
Open challenges in adapting foundation models for IR will be prioritized by the community.
Practical deployment of multimodal models in search and recommendation systems will become more feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar workshops may be needed in adjacent areas where large models face deployment constraints.
Success here could shift research incentives toward parameter-efficient architectures in retrieval.
Without progress on these bottlenecks, commercial IR systems may continue to rely on smaller, less capable models.

Load-bearing premise

That convening researchers at the workshop will produce actionable solutions, new metrics, or benchmarks capable of overcoming the efficiency bottlenecks.

What would settle it

No new efficiency metrics, benchmarks, or concrete adaptation methods for multimodal IR are produced or adopted after the workshop takes place.

read the original abstract

Multimodal representation learning has attracted increasing attention in AI, driven by the strong performance of large, pretrained multimodal foundation models such as Qwen, LLaVA, and CLIP. These models deliver impressive performance on a range of multimodal information retrieval (MIR) tasks, including web search, cross-modal retrieval, and recommender systems. Yet their massive parameter counts create major efficiency bottlenecks when adapting their representations for IR tasks during training, deployment, and inference. These limitations hinder the practical use of foundation models for representation learning in information retrieval. To address these issues, we propose organizing the EReL@MIR workshop at MM 2026, bringing together researchers from academia and industry to discuss emerging solutions, open challenges, and new efficiency metrics and benchmarks for multimodal IR representation learning in the foundation-model era. The workshop's official website is available at https://erel-mir.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a workshop call that restates known efficiency problems with large multimodal models in IR but adds no new methods, data, or results.

read the letter

This document is the call for the second EReL@MIR workshop at MM 2026. It notes that models like CLIP, LLaVA, and Qwen deliver strong results on multimodal retrieval tasks but their size creates real bottlenecks during training, deployment, and inference. The proposed response is to bring researchers together to discuss solutions, challenges, and new benchmarks.

The announcement does a straightforward job naming the practical issue. Efficiency concerns with foundation models in information retrieval are genuine and widely felt, so the problem description lands cleanly.

Beyond that, the text contains no technical contribution. There are no proposed techniques, no experiments, no new metrics, and no analysis of what the first workshop produced. The hope that the event will generate actionable outputs is stated without any supporting plan or evidence.

The soft spots are exactly what you would expect from a workshop proposal rather than a research paper: minimal citations, no derivations, and no engagement with specific prior solutions. These are not hidden flaws; they follow from the document's purpose.

This is for people already working in multimodal IR who might attend the workshop or submit position papers. A reader looking for new methods, reproducible findings, or benchmarks will not find them here.

I would not send this to peer review. It is an organizational announcement, not a manuscript that needs technical refereeing.

Referee Report

0 major / 0 minor

Summary. The manuscript announces the 2nd EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval, to be held at MM 2026. It motivates the event by noting that large pretrained multimodal models (e.g., Qwen, LLaVA, CLIP) deliver strong performance on MIR tasks such as web search and recommender systems but suffer from efficiency bottlenecks due to massive parameter counts during training, deployment, and inference. The workshop is proposed to bring together researchers to discuss emerging solutions, open challenges, and new efficiency metrics and benchmarks.

Significance. If the workshop occurs and generates new metrics, benchmarks, or collaborative solutions, it could have modest community value in highlighting efficiency concerns within multimodal IR. The manuscript itself, however, contains no original derivations, experiments, data, or technical contributions, so its significance as a research article is negligible.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the central concern regarding the manuscript's nature and significance below.

read point-by-point responses

Referee: The manuscript itself, however, contains no original derivations, experiments, data, or technical contributions, so its significance as a research article is negligible. REFEREE RECOMMENDATION: reject

Authors: We agree that the manuscript contains no original technical contributions, derivations, or experiments; it is explicitly a workshop proposal for the 2nd EReL@MIR event at MM 2026. Its purpose is to announce the workshop and motivate discussion on efficiency bottlenecks in large multimodal models for MIR tasks. If the target venue accepts workshop proposals, we believe the community value in highlighting these issues and fostering collaboration justifies consideration, even without new technical results. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The document is a workshop announcement proposing an event to discuss efficiency issues in multimodal IR models. It contains no derivations, equations, predictions, fitted parameters, or technical claims that could form a derivation chain. The sole purpose is motivational description of bottlenecks and the workshop itself, with no load-bearing steps that reduce to inputs by construction or self-citation. The central statement about organizing the workshop does not rely on any self-referential logic or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical claims, derivations, or measurements are present, so the ledger is empty.

pith-pipeline@v0.9.1-grok · 5718 in / 902 out tokens · 22057 ms · 2026-06-29T15:47:04.694805+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. InProceedings of the 10th ACM conference on recommender systems. 191–198

2016
[2]

Junchen Fu, Xuri Ge, Alexandros Karatzoglou, Ioannis Arapakis, Suzan Ver- berne, Joemon M Jose, and Zhaochun Ren. 2026. Differentiable Semantic ID for Generative Recommendation.arXiv preprint arXiv:2601.19711(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Jie Wang, and Joemon M Jose. 2024. IISAN: Efficiently Adapting Multimodal Repre- sentation for Sequential Recommendation with Decoupled PEFT. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 687–697

2024
[4]

Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Kaiwen Zheng, Yongxin Ni, and Joemon M Jose Joemon. 2025. Efficient and effective adaptation of multimodal foundation models in sequential recommendation.IEEE Transactions on Knowledge and Data Engineering(2025)

2025
[5]

Junchen Fu, Xuri Ge, Xin Xin, Haitao Yu, Yue Feng, Alexandros Karatzoglou, Ioannis Arapakis, and Joemon Jose. 2025. The 1st erel@ mir workshop on efficient representation learning for multimodal information retrieval. InCompanion Proceedings of the ACM on Web Conference 2025. 2149–2152

2025
[6]

Junchen Fu, Yongxin Ni, Joemon M Jose, Ioannis Arapakis, Kaiwen Zheng, Youhua Li, and Xuri Ge. 2025. Crossan: Towards efficient and effective adaptation of multiple multimodal foundation models for sequential recommendation.arXiv preprint arXiv:2504.10307(2025)

work page arXiv 2025
[7]

Xuri Ge, Chunhao Wang, Xindi Wang, Zheyun Qin, Zhumin Chen, and Xin Xin. 2026. MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of- Thought Reasoning for Composed Image Retrieval. InProceedings of the ACM Web Conference 2026. 2105–2113

2026
[8]

Xuri Ge, Songpei Xu, Fuhai Chen, Jie Wang, Guoxin Wang, Shan An, and Joemon M Jose. 2024. 3SHNet: Boosting image–sentence retrieval via visual semantic–spatial self-highlighting.Information Processing & Management61, 4 (2024), 103716

2024
[9]

Yaoqin He, Junchen Fu, Kaiwen Zheng, Songpei Xu, Fuhai Chen, Jie Li, Joe- mon M Jose, and Xuri Ge. 2025. Double-filter: Efficient fine-tuning of pre-trained vision-language models via patch&layer filtering. InForty-second International Conference on Machine Learning

2025
[10]

Yongqi Li, Xinyu Lin, Wenjie Wang, Fuli Feng, Liang Pang, Wenjie Li, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2024. A Survey of Generative Search and Recommendation in the Era of Large Language Models.arXiv preprint arXiv:2404.16924(2024)

work page arXiv 2024
[11]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruc- tion tuning.Advances in neural information processing systems36 (2024)

2024
[12]

Yuqing Liu, Yu Wang, Lichao Sun, and Philip S Yu. 2024. Rec-GPT4V: Multi- modal Recommendation with Large Vision-Language Models.arXiv preprint arXiv:2402.08670(2024)

work page arXiv 2024
[13]

Alec Radford, Jong Wook Kim, et al . 2021. Learning transferable visual mod- els from natural language supervision. InInternational conference on machine learning. PMLR, 8748–8763

2021
[14]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen
[15]

Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.061251, 2 (2022), 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search index.Advances in Neural Information Processing Systems 35 (2022), 21831–21843

2022
[17]

Yang Wang, Tao Mei, Jingdong Wang, Houqiang Li, and Shipeng Li. 2011. JIGSAW: interactive mobile visual search with multimodal queries. InProceedings of the 19th ACM international conference on Multimedia. 73–82

2011
[18]

Ziyang Wang, Heba Elfardy, Markus Dreyer, Kevin Small, and Mohit Bansal. 2024. Unified embeddings for multimodal retrieval via frozen LLMs. InFindings of the Association for Computational Linguistics: EACL 2024. 1537–1547

2024
[19]

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404

2024
[20]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. InProceedings of the 27th ACM International Conference on Multimedia. 1437–1445

2019
[21]

Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2021. Mm-rec: multi- modal news recommendation.arXiv preprint arXiv:2104.07407(2021)

work page arXiv 2021
[22]

Jin Xu, Zhifang Guo, et al. 2025. Qwen2.5-Omni Technical Report.arXiv preprint arXiv:2503.20215(2025). The 2𝑛𝑑 EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval Conference’17, July 2017, Washington, DC, USA

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Yifei Yuan, Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke, and Wai Lam. 2024. Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search. InProceedings of the ACM on Web Conference 2024. 1474– 1485

2024
[24]

Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to go next for recommender systems? id- vs. modality-based recommender models revisited. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2639–2649

2023
[25]

Jianyang Zhai, Zi-Feng Mai, Chang-Dong Wang, Feidiao Yang, Xiawu Zheng, Hui Li, and Yonghong Tian. [n. d.]. Multimodal Quantitative Language for Gen- erative Recommendation. InThe Thirteenth International Conference on Learning Representations
[26]

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2025. Large language models for information retrieval: A survey.ACM Transactions on Information Systems44, 1 (2025), 1–54

2025
[27]

Ziyi Zhuang, Hanwen Du, Hui Han, Youhua Li, Junchen Fu, Joemon M Jose, and Yongxin Ni. 2025. Bridging the Gap: Teacher-Assisted Wasserstein Knowledge Distillation for Efficient Multi-Modal Recommendation. InProceedings of the ACM on Web Conference 2025. 2464–2475

2025
[28]

Ziyi Zhuang, Hongji Li, Junchen Fu, Jiacheng Liu, Joemon M Jose, Youhua Li, and Yongxin Ni. 2025. Frequency-Decoupled distillation for efficient multimodal recommendation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 4571–4581

2025

[1] [1]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. InProceedings of the 10th ACM conference on recommender systems. 191–198

2016

[2] [2]

Junchen Fu, Xuri Ge, Alexandros Karatzoglou, Ioannis Arapakis, Suzan Ver- berne, Joemon M Jose, and Zhaochun Ren. 2026. Differentiable Semantic ID for Generative Recommendation.arXiv preprint arXiv:2601.19711(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Jie Wang, and Joemon M Jose. 2024. IISAN: Efficiently Adapting Multimodal Repre- sentation for Sequential Recommendation with Decoupled PEFT. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 687–697

2024

[4] [4]

Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Kaiwen Zheng, Yongxin Ni, and Joemon M Jose Joemon. 2025. Efficient and effective adaptation of multimodal foundation models in sequential recommendation.IEEE Transactions on Knowledge and Data Engineering(2025)

2025

[5] [5]

Junchen Fu, Xuri Ge, Xin Xin, Haitao Yu, Yue Feng, Alexandros Karatzoglou, Ioannis Arapakis, and Joemon Jose. 2025. The 1st erel@ mir workshop on efficient representation learning for multimodal information retrieval. InCompanion Proceedings of the ACM on Web Conference 2025. 2149–2152

2025

[6] [6]

Junchen Fu, Yongxin Ni, Joemon M Jose, Ioannis Arapakis, Kaiwen Zheng, Youhua Li, and Xuri Ge. 2025. Crossan: Towards efficient and effective adaptation of multiple multimodal foundation models for sequential recommendation.arXiv preprint arXiv:2504.10307(2025)

work page arXiv 2025

[7] [7]

Xuri Ge, Chunhao Wang, Xindi Wang, Zheyun Qin, Zhumin Chen, and Xin Xin. 2026. MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of- Thought Reasoning for Composed Image Retrieval. InProceedings of the ACM Web Conference 2026. 2105–2113

2026

[8] [8]

Xuri Ge, Songpei Xu, Fuhai Chen, Jie Wang, Guoxin Wang, Shan An, and Joemon M Jose. 2024. 3SHNet: Boosting image–sentence retrieval via visual semantic–spatial self-highlighting.Information Processing & Management61, 4 (2024), 103716

2024

[9] [9]

Yaoqin He, Junchen Fu, Kaiwen Zheng, Songpei Xu, Fuhai Chen, Jie Li, Joe- mon M Jose, and Xuri Ge. 2025. Double-filter: Efficient fine-tuning of pre-trained vision-language models via patch&layer filtering. InForty-second International Conference on Machine Learning

2025

[10] [10]

Yongqi Li, Xinyu Lin, Wenjie Wang, Fuli Feng, Liang Pang, Wenjie Li, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2024. A Survey of Generative Search and Recommendation in the Era of Large Language Models.arXiv preprint arXiv:2404.16924(2024)

work page arXiv 2024

[11] [11]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruc- tion tuning.Advances in neural information processing systems36 (2024)

2024

[12] [12]

Yuqing Liu, Yu Wang, Lichao Sun, and Philip S Yu. 2024. Rec-GPT4V: Multi- modal Recommendation with Large Vision-Language Models.arXiv preprint arXiv:2402.08670(2024)

work page arXiv 2024

[13] [13]

Alec Radford, Jong Wook Kim, et al . 2021. Learning transferable visual mod- els from natural language supervision. InInternational conference on machine learning. PMLR, 8748–8763

2021

[14] [14]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

[15] [15]

Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.061251, 2 (2022), 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search index.Advances in Neural Information Processing Systems 35 (2022), 21831–21843

2022

[17] [17]

Yang Wang, Tao Mei, Jingdong Wang, Houqiang Li, and Shipeng Li. 2011. JIGSAW: interactive mobile visual search with multimodal queries. InProceedings of the 19th ACM international conference on Multimedia. 73–82

2011

[18] [18]

Ziyang Wang, Heba Elfardy, Markus Dreyer, Kevin Small, and Mohit Bansal. 2024. Unified embeddings for multimodal retrieval via frozen LLMs. InFindings of the Association for Computational Linguistics: EACL 2024. 1537–1547

2024

[19] [19]

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404

2024

[20] [20]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. InProceedings of the 27th ACM International Conference on Multimedia. 1437–1445

2019

[21] [21]

Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2021. Mm-rec: multi- modal news recommendation.arXiv preprint arXiv:2104.07407(2021)

work page arXiv 2021

[22] [22]

Jin Xu, Zhifang Guo, et al. 2025. Qwen2.5-Omni Technical Report.arXiv preprint arXiv:2503.20215(2025). The 2𝑛𝑑 EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval Conference’17, July 2017, Washington, DC, USA

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Yifei Yuan, Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke, and Wai Lam. 2024. Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search. InProceedings of the ACM on Web Conference 2024. 1474– 1485

2024

[24] [24]

Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to go next for recommender systems? id- vs. modality-based recommender models revisited. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2639–2649

2023

[25] [25]

Jianyang Zhai, Zi-Feng Mai, Chang-Dong Wang, Feidiao Yang, Xiawu Zheng, Hui Li, and Yonghong Tian. [n. d.]. Multimodal Quantitative Language for Gen- erative Recommendation. InThe Thirteenth International Conference on Learning Representations

[26] [26]

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2025. Large language models for information retrieval: A survey.ACM Transactions on Information Systems44, 1 (2025), 1–54

2025

[27] [27]

Ziyi Zhuang, Hanwen Du, Hui Han, Youhua Li, Junchen Fu, Joemon M Jose, and Yongxin Ni. 2025. Bridging the Gap: Teacher-Assisted Wasserstein Knowledge Distillation for Efficient Multi-Modal Recommendation. InProceedings of the ACM on Web Conference 2025. 2464–2475

2025

[28] [28]

Ziyi Zhuang, Hongji Li, Junchen Fu, Jiacheng Liu, Joemon M Jose, Youhua Li, and Yongxin Ni. 2025. Frequency-Decoupled distillation for efficient multimodal recommendation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 4571–4581

2025