Recognition: unknown
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
Pith reviewed 2026-05-10 08:30 UTC · model grok-4.3
The pith
SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8% to 87.5% and the 10k image-to-recipe R@1 from 56.5% to 65.5% compared to the previous best method.
Load-bearing premise
That VLM2Vec with the authors' prompt templates and component-aware augmentation can reliably close the semantic gap between images and structured recipe text without the alignment mechanisms required by dual-encoder models, and that the reported gains will hold on data outside Recipe1M.
Figures
read the original abstract
Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8\% to 87.5\% and the 10k image-to-recipe R@1 from 56.5\% to 65.5\% compared to the previous best method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that SIMMER, a single MLLM-based embedding model using VLM2Vec with custom prompt templates for recipe structures (title, ingredients, instructions) and component-aware data augmentation, achieves state-of-the-art cross-modal retrieval performance on the Recipe1M dataset. It reports specific improvements in image-to-recipe retrieval recall at rank 1 (R@1) of 5.7 percentage points in the 1k setting and 9 percentage points in the 10k setting over previous best methods, positioning the unified encoder as a replacement for dual-encoder architectures.
Significance. If the results are robust, this work is significant for demonstrating that off-the-shelf MLLM embedding models, enhanced with task-specific prompting and augmentation, can effectively address the semantic gap in image-recipe retrieval without the need for separate encoders and alignment losses. This could streamline future research in multimodal retrieval for structured domains and provide a baseline for MLLM applications in computer vision.
major comments (1)
- [Experiments] The SOTA claims rest entirely on evaluations using the Recipe1M dataset and its standard 1k/10k splits. Given that the prompt templates and augmentation strategy are explicitly designed around Recipe1M's title, ingredients, and instructions format, the central claim that this approach supersedes dual-encoder methods requires validation on at least one additional dataset to rule out dataset-specific effects.
minor comments (2)
- [Abstract] The abstract refers to 'our best model' without specifying the exact configuration or number of parameters; this should be made explicit.
- [Related Work] Ensure all baseline methods are cited with original references and that any ablation studies on prompt variants or augmentation components are clearly presented with quantitative results.
Circularity Check
No circularity; empirical engineering on VLM2Vec with no self-referential derivations
full rationale
The paper applies an existing MLLM (VLM2Vec) to image-recipe retrieval using custom prompt templates for recipe structure (title/ingredients/instructions) and a component-aware augmentation strategy that trains on complete/partial recipes. It reports empirical SOTA gains on standard Recipe1M 1k/10k splits (e.g., R@1 lifts from 81.8% to 87.5% and 56.5% to 65.5%). No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the derivation; the method is a direct application plus task-specific engineering whose validity rests on external dataset results rather than reducing to its own inputs by construction. The central claim is therefore self-contained against benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal LLMs can generate effective joint embeddings for images and structured text (title, ingredients, instructions) when given suitable prompts
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint arXiv:2404.14219, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[2]
Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Em- beddings
Micael Carvalho, R ´emi Cad`ene, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Em- beddings. InProceedings of ACM SIGIR Conference on Re- search and Development in Information Retrieval, pages 35– 44, 2018. 3, 7, 16, 19, 20
2018
-
[3]
Deep Understanding of Cooking Procedure for Cross- modal Recipe Retrieval
Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, and Tat-Seng Chua. Deep Understanding of Cooking Procedure for Cross- modal Recipe Retrieval. InProceedings of ACM Interna- tional Conference on Multimedia, pages 1020–1028, 2018. 3, 16, 19, 20
2018
-
[4]
UNITER: Universal Image-TExt Representation Learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: Universal Image-TExt Representation Learning. InProceedings of European Conference on Computer Vision, pages 104–120, 2020. 2
2020
-
[5]
ImageNet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255,
-
[6]
Fleet, Jamie Ryan Kiros, and Sanja Fidler
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. InProceedings of British Machine Vi- sion Conference, 2018. 2
2018
-
[7]
Mikhail Fain, Niall Twomey, Andrey Ponikar, Ryan Fox, and Danushka Bollegala. Dividing and Conquering Cross- Modal Recipe Retrieval: From Nearest Neighbours Base- lines to SoTA.arXiv preprint arXiv:1911.12763, 2019. 3, 16, 19, 20
-
[8]
Corrado, Jonathon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov
Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. DeViSE: A Deep Visual-Semantic Embedding Model. InAdvances in Neural Information Processing Sys- tems, 2013. 2
2013
-
[9]
MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model
Han Fu, Rui Wu, Chenghao Liu, and Jianling Sun. MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14558–14568, 2020. 3, 7, 16, 19, 20
2020
-
[10]
Scaling Deep Contrastive Learning Batch Size under Mem- ory Limited Setup
Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling Deep Contrastive Learning Batch Size under Mem- ory Limited Setup. InProceedings of Workshop on Repre- sentation Learning for NLP, pages 316–321, 2021. 5
2021
-
[11]
Pham, and Vladimir Pavlovic
Ricardo Guerrero, Hai X. Pham, and Vladimir Pavlovic. Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace Learning. InProceed- ings of ACM International Conference on Multimedia, pages 3192–3201, 2021. 3, 7, 16, 19, 20
2021
-
[12]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. InProceed- ings of IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 770–778, 2016. 3
2016
-
[13]
Long Short-Term Memory.Neural Computation, 9(8):1735–1780, 1997
Sepp Hochreiter and J ¨urgen Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8):1735–1780, 1997. 3
1997
-
[14]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In 11 Quer y Gr ound T rut h T op 5 R etrie v ed D AR Ours D AR Ours D AR Ours Figure 5. Qualitative comparison of image-to-recipe retrieval between SIMMER (V1-7B) and DAR [40]. For each query ...
2022
-
[15]
Improv- ing Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding
Xu Huang, Jin Liu, Zhizhong Zhang, and Yuan Xie. Improv- ing Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding. InProceedings of ACM Inter- national Conference on Multimedia, pages 529–537, 2023. 3, 7, 17, 19, 20
2023
-
[16]
Scaling Up Visual and Vision-Language Repre- sentation Learning With Noisy Text Supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling Up Visual and Vision-Language Repre- sentation Learning With Noisy Text Supervision. InPro- ceedings of International Conference on Machine Learning, pages 4904–4916, 2021. 2
2021
-
[17]
VLM2Vec: Training Vision- Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2Vec: Training Vision- Language Models for Massive Multimodal Embedding Tasks. InProceedings of International Conference on Learn- ing Representations, 2025. 2, 4
2025
-
[18]
Kingma and Jimmy Ba
Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. InProceedings of International Conference on Learning Representations, 2015. 5
2015
- [19]
-
[20]
Stacked Cross Attention for Image-Text Match- ing
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi- aodong He. Stacked Cross Attention for Image-Text Match- ing. InProceedings of European Conference on Computer Vision, pages 201–216, 2018. 2
2018
-
[21]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. LLaV A-NeXT- Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models.arXiv preprint arXiv:2407.07895,
work page internal anchor Pith review arXiv
-
[22]
Cross-Modal Image-Recipe Retrieval via Intra- and Inter- Modality Hybrid Fusion
Jiao Li, Jialiang Sun, Xing Xu, Wei Yu, and Fumin Shen. Cross-Modal Image-Recipe Retrieval via Intra- and Inter- Modality Hybrid Fusion. InProceedings of ACM Interna- tional Conference on Multimedia Retrieval, pages 173–182,
-
[23]
in a medium saucepan o v er medium-high heat, r educe maple syrup b y a quar t er , 5 t o 7 minut es
3, 16, 19, 20 12 Tit le Ingr edient s Instructions Gr ound T rut h T op 5 R etrie v ed nutmeg- maple cr eam pie 3/4 cup maple syrup 2 1/4 cups hea vy cr eam 4 egg y olks 1 whole egg 1/4 t easpoon salt 1 t easpoon fr eshly grat ed nutmeg 1 t easpoon v anilla e xtract 1 pr e-bak ed 9-inch pie crust (see r ecipe) pr eheat o v en t o 300 degr ees. in a medium...
-
[24]
Hybrid Fusion with Intra- and Cross- Modality Attention for Image-Recipe Retrieval
Jiao Li, Xing Xu, Wei Yu, Fumin Shen, Zuo Cao, Kai Zuo, and Heng Tao Shen. Hybrid Fusion with Intra- and Cross- Modality Attention for Image-Recipe Retrieval. InProceed- ings of ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 244–254, 2021. 3, 16, 19, 20
2021
-
[25]
BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. In Proceedings of International Conference on Machine Learn- ing, pages 12888–12900, 2022. 2
2022
-
[26]
Multi-subspace Implicit Alignment for Cross-modal Retrieval on Cooking Recipes and Food Images
Lin Li, Ming Li, Zichen Zan, Qing Xie, and Jianquan Liu. Multi-subspace Implicit Alignment for Cross-modal Retrieval on Cooking Recipes and Food Images. InPro- ceedings of ACM International Conference on Information and Knowledge Management, pages 3211–3215, 2021. 3, 16, 19, 20
2021
-
[27]
Cross-modal Image-Recipe Retrieval via Multimodal Fusion
Lijie Li, Caiyue Hu, Haitao Zhang, and Akshita Maradapu Vera Venkata sai. Cross-modal Image-Recipe Retrieval via Multimodal Fusion. InProceedings of ACM International Conference on Multimedia in Asia, pages 1–7, 2023. 17, 19, 20
2023
-
[28]
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs. In Proceedings of International Conference on Learning Rep- resentations, 2025. 2
2025
-
[29]
Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders
Wenhao Liu, Simiao Yuan, Zhen Wang, Xinyi Chang, Li- meng Gao, and Zhenrui Zhang. Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders. Mathematics, 12(20):3181, 2024. 3, 17, 19, 20
2024
-
[30]
ViL- BERT: Pretraining Task-Agnostic Visiolinguistic Represen- tations for Vision-and-Language Tasks
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViL- BERT: Pretraining Task-Agnostic Visiolinguistic Represen- tations for Vision-and-Language Tasks. InAdvances in Neu- ral Information Processing Systems, 2019. 2
2019
-
[31]
arXiv preprint arXiv:2507.04590 , year=
Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents.arXiv preprint arXiv:2507.04590, 2025. 2, 3, 4
-
[32]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vec- tor Space.arXiv preprint arXiv:1301.3781, 2013. 3
work page internal anchor Pith review arXiv 2013
-
[33]
Papadopoulos, Enrique Mora, Nadiia Chepurko, Kuan Wei Huang, Ferda Ofli, and Antonio Torralba
Dim P. Papadopoulos, Enrique Mora, Nadiia Chepurko, Kuan Wei Huang, Ferda Ofli, and Antonio Torralba. Learn- ing Program Representations for Food Images and Cooking Recipes. InProceedings of IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 16538–16548,
-
[34]
Pham, Ricardo Guerrero, Vladimir Pavlovic, and Ji- atong Li
Hai X. Pham, Ricardo Guerrero, Vladimir Pavlovic, and Ji- atong Li. CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval. InProceedings of AAAI Conference on Artificial Intelligence, pages 2423–2430, 2021. 3, 16, 19, 20
2021
-
[35]
Learning Transferable Visual 13 Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual 13 Models From Natural Language Supervision. InProceedings of International Conference on Machine Learning, pages 8748–8763, 2021. 2, 3
2021
-
[36]
Learn- ing Cross-Modal Embeddings for Cooking Recipes and Food Images
Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learn- ing Cross-Modal Embeddings for Cooking Recipes and Food Images. InProceedings of IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 3068–3076,
-
[37]
2, 3, 5, 7, 16, 19, 20
-
[38]
Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learn- ing
Amaia Salvador, Erhan Gundogdu, Loris Bazzani, and Michael Donoser. Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learn- ing. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15470–15479, 2021. 2, 3, 5, 7, 16, 19, 20
2021
-
[39]
Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
Mustafa Shukor, Guillaume Couairon, Asya Grechka, and Matthieu Cord. Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval. InProceed- ings of IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops, pages 4566–4577, 2022. 2, 3, 7, 17, 19, 20
2022
-
[40]
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval.Computer Vision and Image Understanding, 247 (C), 2024
Mustafa Shukor, Nicolas Thome, and Matthieu Cord. Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval.Computer Vision and Image Understanding, 247 (C), 2024. 3, 7, 17, 19, 20
2024
-
[41]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very Deep Convo- lutional Networks for Large-Scale Image Recognition.arXiv preprint arXiv:1409.1556, 2014. 3
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[42]
En- hancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective
Fangzhou Song, Bin Zhu, Yanbin Hao, and Shuo Wang. En- hancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective. InProceedings of European Conference on Computer Vision, pages 111–127, 2024. 3, 7, 9, 12, 13, 17, 19, 20
2024
-
[43]
Cross-Modal Recipe Embed- dings by Disentangling Recipe Contents and Dish Styles
Yu Sugiyama and Keiji Yanai. Cross-Modal Recipe Embed- dings by Disentangling Recipe Contents and Dish Styles. In Proceedings of ACM International Conference on Multime- dia, pages 2501–2509, 2021. 3, 7, 16, 19, 20
2021
-
[44]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen- tation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2018. 5
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Attention is All you Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. InAdvances in Neu- ral Information Processing Systems, 2017. 3
2017
-
[46]
Bhanu Prakash V outharoja, Peng Wang, Lei Wang, and Vivienne Guan. MALM: Mask Augmentation based Lo- cal Matching for Food-Recipe Retrieval.arXiv preprint arXiv:2305.11327, 2023. 17, 19, 20
-
[47]
Fine-Grained Alignment for Cross-Modal Recipe Retrieval
Muntasir Wahed, Xiaona Zhou, Tianjiao Yu, and Ismini Lourentzou. Fine-Grained Alignment for Cross-Modal Recipe Retrieval. InProceedings of IEEE Winter Confer- ence on Applications of Computer Vision, pages 5572–5581,
-
[48]
Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven C. H. Hoi. Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Im- ages. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11564–11573, 2019. 2, 3, 7, 16, 19, 20
2019
-
[49]
Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval
Hao Wang, Guosheng Lin, Steven Hoi, and Chunyan Miao. Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval. InProceedings of ACM Interna- tional Conference on Multimedia, pages 5517–5526, 2022. 16, 19, 20
2022
-
[50]
Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Palakorn Achananuparp, Ee-peng Lim, and Steven C. H. Hoi. Cross- Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes With Semantic Consistency and Atten- tion Mechanism.IEEE Transactions on Multimedia, 24: 2515–2525, 2022. 3, 16, 19, 20
2022
-
[51]
Hao Wang, Guosheng Lin, Steven C. H. Hoi, and Chunyan Miao. Learning Structural Representations for Recipe Gen- eration and Food Retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3363–3377, 2023. 17, 19, 20
2023
-
[52]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, et al. Qwen2-VL: Enhancing Vision- Language Model’s Perception of the World at Any Resolu- tion.arXiv preprint arXiv:2409.12191, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Mitigating Cross-modal Representation Bias for Multicul- tural Image-to-Recipe Retrieval
Qing Wang, Chong-Wah Ngo, Yu Cao, and Ee-Peng Lim. Mitigating Cross-modal Representation Bias for Multicul- tural Image-to-Recipe Retrieval. InProceedings of ACM International Conference on Multimedia, pages 6223–6231,
-
[54]
Learn- ing Joint Embedding with Modality Alignments for Cross- Modal Retrieval of Recipes and Food Images
Zhongwei Xie, Ling Liu, Lin Li, and Luo Zhong. Learn- ing Joint Embedding with Modality Alignments for Cross- Modal Retrieval of Recipes and Food Images. InProceed- ings of ACM International Conference on Information and Knowledge Management, pages 2221–2230, 2021. 3, 16, 19, 20
2021
-
[55]
Learning Text-image Joint Embedding for Effi- cient Cross-modal Retrieval with Deep Feature Engineer- ing.ACM Transactions on Information Systems, 40(4):74:1– 74:27, 2021
Zhongwei Xie, Ling Liu, Yanzhao Wu, Luo Zhong, and Lin Li. Learning Text-image Joint Embedding for Effi- cient Cross-modal Retrieval with Deep Feature Engineer- ing.ACM Transactions on Information Systems, 40(4):74:1– 74:27, 2021. 3, 16, 19, 20
2021
-
[56]
Cross-Modal Retrieval between Event-Dense Text and Image
Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, and Ling Liu. Cross-Modal Retrieval between Event-Dense Text and Image. InProceedings of ACM International Conference on Multimedia Retrieval, pages 229–238, 2022. 3, 16, 19, 20
2022
-
[57]
Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service.IEEE Trans- actions on Services Computing, 15(6):3304–3316, 2022
Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, and Luo Zhong. Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service.IEEE Trans- actions on Services Computing, 15(6):3304–3316, 2022. 3, 17, 19, 20
2022
-
[58]
Transformer- Based Cross-Modal Recipe Embeddings with Large Batch Training
Jing Yang, Junwen Chen, and Keiji Yanai. Transformer- Based Cross-Modal Recipe Embeddings with Large Batch Training. InProceedings of International Conference on Multimedia Modeling, pages 471–482, 2023. 2, 3, 7, 17, 19, 20
2023
-
[59]
Improving Cross- Modal Recipe Embeddings with Cross Decoder
Jing Yang, Junwen Chen, and Keiji Yanai. Improving Cross- Modal Recipe Embeddings with Cross Decoder. InProceed- ings of ACM Workshop on Intelligent Cross-Data Analysis and Retrieval, pages 1–4, 2024. 3, 6, 7, 17, 19, 20
2024
-
[60]
RecipeRAG: Advancing 14 Recipe Generation with Reinforced Retrieval Augmented Generation
Jinghan Yang, Zhenbo Xu, Dehua Ma, Liu Liu, Fei Liu, Gong Huang, and Zhaofeng He. RecipeRAG: Advancing 14 Recipe Generation with Reinforced Retrieval Augmented Generation. InProceedings of ACM International Confer- ence on Multimedia, pages 5060–5069, 2025. 3, 17, 19, 20
2025
-
[61]
Sentence- based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images
Zichen Zan, Lin Li, Jianquan Liu, and Dong Zhou. Sentence- based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images. InProceedings of ACM Interna- tional Conference on Multimedia Retrieval, pages 117–125,
-
[62]
Sigmoid Loss for Language Image Pre- Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. InProceedings of IEEE International Conference on Computer Vision, pages 11975–11986, 2023. 2
2023
-
[63]
Cross-modal recipe retrieval based on unified text encoder with fine- grained contrastive learning.Knowledge-Based Systems, 305:112641, 2024
Bolin Zhang, Haruya Kyutoku, Keisuke Doman, Takahiro Komamizu, Ichiro Ide, and Jiangbo Qian. Cross-modal recipe retrieval based on unified text encoder with fine- grained contrastive learning.Knowledge-Based Systems, 305:112641, 2024. 3, 7, 17, 19, 20
2024
-
[64]
Cross modal recipe retrieval with fine grained modal interaction
Fan Zhao, Yuqing Lu, Zhuo Yao, and Fangying Qu. Cross modal recipe retrieval with fine grained modal interaction. Scientific Reports, 15(1):4842, 2025. 3, 7, 17, 19, 20
2025
-
[65]
Exploring latent weight factors and global infor- mation for food-oriented cross-modal retrieval.Connection Science, 35(1):2233714, 2023
Wenyu Zhao, Dong Zhou, Buqing Cao, Wei Liang, and Nitin Sukhija. Exploring latent weight factors and global infor- mation for food-oriented cross-modal retrieval.Connection Science, 35(1):2233714, 2023. 17, 19, 20
2023
-
[66]
Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval.Multi- media Tools and Applications, 83(2):3601–3619, 2024
Wenyu Zhao, Dong Zhou, Buqing Cao, Kai Zhang, and Jin- jun Chen. Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval.Multi- media Tools and Applications, 83(2):3601–3619, 2024. 17, 19, 20
2024
-
[67]
R2GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network
Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. R2GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network. InProceedings of IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11469–11478, 2019. 2, 3, 16, 19, 20
2019
-
[68]
CREAMY: Cross-Modal Recipe Retrieval By Avoid- ing Matching Imperfectly.IEEE Access, 12:33283–33295,
Zhuoyang Zou, Xinghui Zhu, Qinying Zhu, Yi Liu, and Lei Zhu. CREAMY: Cross-Modal Recipe Retrieval By Avoid- ing Matching Imperfectly.IEEE Access, 12:33283–33295,
-
[69]
dish shape/style
Zhuoyang Zou, Xinghui Zhu, Qinying Zhu, Hongyan Zhang, and Lei Zhu. Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval.Foods, 13(11):1628, 2024. 3, 17, 19, 20 15 A. Baseline Methods and Full Experimental Results A.1. Baseline Descriptions We compare SIMMER against the following methods for cross-modal fo...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.