FAST-MEL: A Fast, Accurate, and Storage Efficient Solution for Multimodal Entity Linking
Pith reviewed 2026-06-27 08:34 UTC · model grok-4.3
The pith
FAST-MEL uses a compact fixed-size vector for each entity's text and images to link multimodal mentions at the accuracy of top systems while running three orders of magnitude faster and using one order of magnitude less storage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FAST-MEL is a lightweight encoder-based MEL solution that relies on a novel and compact fixed-size vectorized representation of both the textual and visual information of each entity or mention. It matches the accuracy of the best systems but performs three orders of magnitude faster. It also consumes one order of magnitude less storage than the fastest systems.
What carries the argument
A novel compact fixed-size vectorized representation that jointly encodes textual and visual information for each entity or mention.
If this is right
- MEL systems can now operate on much larger knowledge bases without prohibitive compute or memory costs.
- Real-time multimodal applications such as social media monitoring or visual search become practical at scale.
- The same compact representation can serve both entity linking and downstream tasks that reuse the index.
- Storage savings allow deployment on resource-constrained devices or cloud instances with smaller memory footprints.
Where Pith is reading between the lines
- If the vector representation generalizes, similar compact encodings could improve efficiency in other multimodal retrieval tasks such as image captioning or cross-modal search.
- The speed gains suggest the method could support continuous updating of knowledge bases in streaming data scenarios.
- Because storage is reduced, organizations with very large entity catalogs could maintain multiple versions of the index for different domains.
Load-bearing premise
That a compact fixed-size vector can retain enough detail from both text and images to keep linking accuracy high while delivering the claimed speed and storage savings.
What would settle it
Running the system on a large multimodal test set and observing either a drop below the accuracy of the best prior systems or failure to achieve three orders of magnitude speedup and one order of magnitude storage reduction.
Figures
read the original abstract
Multimodal entity linking (MEL) is the task that consists of matching textual and visual mentions of entities in unstructured data to their corresponding entities in a knowledge base (KB). To be effective in large-scale practical settings, MEL systems must meet three objectives: high linking accuracy, computational efficiency, and storage efficiency, i.e., a compact yet efficient index of the KB. In this paper, we highlight that state-of-the-art systems fail to simultaneously satisfy these 3 requirements. To meet this three-fold objective, we propose FAST-MEL, a lightweight encoder-based MEL solution that relies on a novel and compact fixed-size vectorized representation of both the textual and visual information of each entity or mention. It matches the accuracy of the best systems but performs three orders of magnitude faster. It also consumes one order of magnitude less storage than the fastest systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FAST-MEL, a lightweight encoder-based solution for multimodal entity linking (MEL) that relies on a novel compact fixed-size vectorized representation of textual and visual information for entities and mentions. It claims this approach simultaneously achieves high linking accuracy comparable to state-of-the-art systems, three orders of magnitude faster inference, and one order of magnitude lower storage than the fastest existing systems, addressing the failure of prior MEL systems to meet all three requirements at once.
Significance. If the performance claims hold under rigorous evaluation, the result would be significant for practical large-scale MEL deployments, as it targets the joint optimization of accuracy, speed, and storage that current systems do not achieve. The introduction of a compact multimodal representation could influence indexing strategies in knowledge-base applications.
major comments (1)
- [Abstract] Abstract: the central claims of matching SOTA accuracy, 1000x speedup, and 10x storage reduction are asserted without any experimental results, tables, figures, baselines, error bars, or methodological details. The load-bearing assumption that the novel fixed-size vectorized representation preserves sufficient multimodal information for accurate linking therefore cannot be evaluated from the provided material.
Simulated Author's Rebuttal
We thank the referee for their review. The comment focuses on the abstract's presentation of results. We address this below and note that the full manuscript contains the supporting experimental details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of matching SOTA accuracy, 1000x speedup, and 10x storage reduction are asserted without any experimental results, tables, figures, baselines, error bars, or methodological details. The load-bearing assumption that the novel fixed-size vectorized representation preserves sufficient multimodal information for accurate linking therefore cannot be evaluated from the provided material.
Authors: The abstract is intended as a concise summary of the paper's main contributions and findings. The full manuscript provides the requested details: Section 3 describes the compact fixed-size vectorized representation and its construction from textual and visual features; Section 4 presents the experimental evaluation, including direct comparisons to state-of-the-art baselines with accuracy metrics, inference latency measurements (demonstrating three orders of magnitude speedup), storage footprint analysis (one order of magnitude reduction), and supporting tables and figures. Error bars and statistical details are included where appropriate in the results. These experiments directly validate that the representation preserves sufficient multimodal information for linking accuracy comparable to prior systems. revision: no
Circularity Check
No significant circularity
full rationale
The provided abstract and context contain no equations, derivations, or load-bearing self-citations. The central claim is an empirical proposal of a compact vectorized representation for MEL that achieves stated efficiency/accuracy tradeoffs; no step reduces by construction to fitted inputs, self-definitions, or author-prior ansatzes. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
invented entities (1)
-
compact fixed-size vectorized representation
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Zhiwei Hu, Víctor Gutiérrez-Basulto, Ru Li, and Jeff Z. Pan. 2025. Multi-level Matching Network for Multimodal Entity Linking. InProceedings of the 31st ACM SIGKDD conference on Knowledge Discovery and Data Mining (KDD’25). Association for Computing Machinery, New York, NY, 508–519. doi:10.1145/ 3690624.3709306
arXiv 2025
-
[2]
Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, and Jeff Z. Pan. 2025. Multi-level Mixture of Experts for Multimodal Entity Linking. InProceedings of the 31st ACM SIGKDD conference on Knowledge Discovery and Data Mining (KDD’25), Vol. 2. Association for Computing Machinery, New York, NY, 979–990. doi:10.1145/3711896.3737060
-
[3]
Juyeon Kim, Geon Lee, Taeuk Kim, and Kijung Shin. 2025. KGMEL: Knowledge Graph-enhanced Multimodal Entity Linking. InProceedings of the 48th interna- tional ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2025). Association for Computing Machinery, New York, NY, 3015–3019. doi:10.1145/3726302.3730217
-
[4]
Qi Liu, Yongyi He, Tong Xu, Defu Lian, Che Liu, Zhi Zheng, and Enhong Chen
-
[5]
Hique: Hierarchical question embedding network for multimodal depression detection,
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models. InProceedings of the 33rd ACM international Conference on Information and Knowledge Management (CIKM’24). Association for Computing Machinery, New York, NY, 1909–1919. doi:10.1145/3627673.3679793
-
[6]
Ziyan Liu, Junwen Li, Kaiwen Li, Tong Ruan, Chao Wang, Xinyan He, Zongyu Wang, Xuezhi Cao, and Jingping Liu. 2025. I2CR: Intra- and Inter-modal Col- laborative Reflections for Multimodal Entity Linking. InProceedings of the 33rd ACM international conference on Multimedia (MM’25). Association for Computing Machinery, New York, NY, 4942–4951. doi:10.1145/37...
-
[7]
Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based Knowledge Conflicts in Question An- swering. InProceedings of the 2021 conference on Empirical Methods in Natural Language Processing (EMNLP 2021). Association for Computational Linguistics, 7052–7063. doi:10.18653/v1/2021.emnlp-main.565
-
[8]
Pengfei Luo, Tong Xu, Che Liu, Suojuan Zhang, Linli Xu, Minglei Li, and Enhong Chen. 2024. Bridging Gaps in Content and Knowledge for Multimodal Entity Linking. InProceedings of the 32nd ACM international conference on Multimedia (MM’24). Association for Computing Machinery, New York, NY, 9311–9320. doi:10. 1145/3664647.3681661
arXiv 2024
-
[9]
Pengfei Luo, Tong Xu, Shiwei Wu, Chen Zhu, Linli Xu, and Enhong Chen. 2023. Multi-grained Multimodal Interaction Network for Entity Linking. InProceedings of the 29th ACM SIGKDD conference on Knowledge Discovery and Data Mining (KDD’23). Association for Computing Machinery, New York, NY, 1583–1594. doi:10.1145/3580305.3599439
-
[10]
Edgar Meij, Krisztian Balog, and Daan Odijk. 2014. Entity Linking and Retrieval for Semantic Search. InProceedings of the 7th ACM international conference on Web Search and Data Mining (WSDM’14). Association for Computing Machinery, New York, NY, 683–684. doi:10.1145/2556195.2556201
-
[11]
Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal Named Entity Disambiguation for Noisy Social Media Posts. InProceedings of the 56th annual meeting of the Association for Computational Linguistics (ACL 2018). Association for Computational Linguistics, 2000–2008. doi:10.18653/v1/P18-1186
-
[12]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models from Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning (ICML 2021), Vol. 139. ...
2021
-
[13]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019). Association for Computational Linguistics, 3982–3992. doi:10.18653/v...
-
[14]
Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. InProceedings of the 3rd Text Retrieval Conference (TREC-3). Department of Commerce, NIST, 109–126. https://trec.nist. gov/pubs/trec3/papers/city.ps.gz
1994
-
[15]
Senbao Shi, Zhenran Xu, Baotian Hu, and Min Zhang. 2024. Generative Multi- modal Entity Linking. InProceedings of the 2024 joint international conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, 7654–7665. https://aclanthology.org/2024.lrec-main.676/
2024
-
[16]
Xuhui Sui, Ying Zhang, Yu Zhao, Kehui Song, Baohang Zhou, and Xiaojie Yuan
-
[17]
InProceedings of Findings of the Association for Computational Linguistics (ACL 2024)
MELOV: Multimodal Entity Linking with Optimized Visual Features in Latent Space. InProceedings of Findings of the Association for Computational Linguistics (ACL 2024). Association for Computational Linguistics, 816–826. doi:10. 18653/v1/2024.findings-acl.46
2024
-
[18]
Representation Learning with Contrastive Predictive Coding
Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018.Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748 doi:10.48550/arXiv.1807. 03748
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807 2018
-
[19]
Peng Wang, Jiangheng Wu, and Xiaohang Chen. 2022. Multimodal Entity Link- ing with Gated Hierarchical Fusion and Contrastive Training. InProceedings of the 45th international ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2022). Association for Computing Machinery, New York, NY, 938–948. doi:10.1145/3477495.3531867
-
[20]
Xuwu Wang, Junfeng Tian, Min Gui, Zhixu Li, Rui Wang, Ming Yan, Lihan Chen, and Yanghua Xiao. 2022. WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types. InProceedings of the 60th annual meeting of the Association for Computational Linguistics (ACL 2022). Association for Computational Linguistics, 4785–4797....
-
[21]
Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang
-
[22]
InProceedings of the 57th annual meeting of the Association for Computational Linguistics (ACL 2019)
Improving Question Answering over Incomplete KBs with Knowledge- Aware Reader. InProceedings of the 57th annual meeting of the Association for Computational Linguistics (ACL 2019). Association for Computational Linguistics, 4258–4264. doi:10.18653/v1/P19-1417
-
[23]
Zefeng Zhang, Jiawei Sheng, Chuang Zhang, Liangyunzhi Liangyunzhi, Wenyuan Zhang, Siqi Wang, and Tingwen Liu. 2024. Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking. InProceedings of Findings of the Asso- ciation for Computational Linguistics (ACL 2024). Association for Computational Linguistics, 4103–4117. doi:10.18653/v1/202...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.