arxiv: 2604.15628 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.CL· cs.IR· cs.LG· cs.MM

Recognition: unknown

SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

Keiji Yanai, Keisuke Gomi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:30 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IRcs.LGcs.MM

keywords embeddingmodelfoodrecipessimmerbestcookingcross-modal

0 comments

The pith

SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Matching a photo of a meal to its written recipe is hard because pictures and words live in different formats. Older systems ran two separate AI models—one for images, one for text—and added extra steps to line them up. SIMMER instead feeds both the photo and the recipe text into one big multimodal model that already understands both kinds of input. The authors write special prompts that tell the model how to handle recipe parts like ingredients and steps, and they train it on both complete recipes and recipes missing some pieces so it stays accurate even with incomplete data.

Core claim

Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8% to 87.5% and the 10k image-to-recipe R@1 from 56.5% to 65.5% compared to the previous best method.

Load-bearing premise

That VLM2Vec with the authors' prompt templates and component-aware augmentation can reliably close the semantic gap between images and structured recipe text without the alignment mechanisms required by dual-encoder models, and that the reported gains will hold on data outside Recipe1M.

Figures

Figures reproduced from arXiv: 2604.15628 by Keiji Yanai, Keisuke Gomi.

**Figure 2.** Figure 2: Overview of the contrastive learning procedure used for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples of image-to-recipe retrieval by SIMMER (V1-7B). For each query image (left), the top-5 retrieved recipes [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of image-to-recipe retrieval between SIMMER (V1-7B) and DAR [ [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of recipe-to-image retrieval between SIMMER (V1-7B) and DAR [ [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8\% to 87.5\% and the 10k image-to-recipe R@1 from 56.5\% to 65.5\% compared to the previous best method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIMMER shows clear gains on Recipe1M by swapping dual encoders for a prompted single MLLM plus partial-recipe training, but the thin experimental details make it hard to tell if the approach is robust or just tuned to this dataset.

read the letter

The core move here is straightforward: take an off-the-shelf MLLM embedding model like VLM2Vec, feed it food images and recipe text through custom prompts that respect the title-ingredients-instructions structure, and add a simple augmentation that exposes the model to partial recipes during training. On the standard Recipe1M splits this produces a lift in image-to-recipe R@1 from 81.8% to 87.5% at 1k and from 56.5% to 65.5% at 10k over the previous best numbers. That is a real improvement on a practical task, and the unified encoder avoids the usual alignment machinery that dual-encoder papers spend pages describing. The prompt templates and component-aware augmentation are the parts that feel like actual engineering work rather than another dual-encoder variant. If those details transfer, the method could be a lighter baseline for nutrition or cooking apps. The main weakness is that the abstract gives headline metrics without showing the baseline implementations, ablation tables, run-to-run variance, or any test outside Recipe1M. The prompts are explicitly built around Recipe1M's format, so the gains could shrink on other food datasets or messier user-generated recipes. Without those checks it is difficult to know whether the single-encoder claim holds or whether we are seeing dataset-specific tuning. This is the kind of paper that belongs in a food-computing or applied multimodal retrieval venue. Readers who need a working system for image-recipe lookup would find the prompt and augmentation ideas worth trying, even if they later add their own controls. It is solid enough on the idea and the reported numbers to go to a serious referee, though the methods and results sections will need expansion before acceptance.

Referee Report

1 major / 2 minor

Summary. The paper claims that SIMMER, a single MLLM-based embedding model using VLM2Vec with custom prompt templates for recipe structures (title, ingredients, instructions) and component-aware data augmentation, achieves state-of-the-art cross-modal retrieval performance on the Recipe1M dataset. It reports specific improvements in image-to-recipe retrieval recall at rank 1 (R@1) of 5.7 percentage points in the 1k setting and 9 percentage points in the 10k setting over previous best methods, positioning the unified encoder as a replacement for dual-encoder architectures.

Significance. If the results are robust, this work is significant for demonstrating that off-the-shelf MLLM embedding models, enhanced with task-specific prompting and augmentation, can effectively address the semantic gap in image-recipe retrieval without the need for separate encoders and alignment losses. This could streamline future research in multimodal retrieval for structured domains and provide a baseline for MLLM applications in computer vision.

major comments (1)

[Experiments] The SOTA claims rest entirely on evaluations using the Recipe1M dataset and its standard 1k/10k splits. Given that the prompt templates and augmentation strategy are explicitly designed around Recipe1M's title, ingredients, and instructions format, the central claim that this approach supersedes dual-encoder methods requires validation on at least one additional dataset to rule out dataset-specific effects.

minor comments (2)

[Abstract] The abstract refers to 'our best model' without specifying the exact configuration or number of parameters; this should be made explicit.
[Related Work] Ensure all baseline methods are cited with original references and that any ablation studies on prompt variants or augmentation components are clearly presented with quantitative results.

Circularity Check

0 steps flagged

No circularity; empirical engineering on VLM2Vec with no self-referential derivations

full rationale

The paper applies an existing MLLM (VLM2Vec) to image-recipe retrieval using custom prompt templates for recipe structure (title/ingredients/instructions) and a component-aware augmentation strategy that trains on complete/partial recipes. It reports empirical SOTA gains on standard Recipe1M 1k/10k splits (e.g., R@1 lifts from 81.8% to 87.5% and 56.5% to 65.5%). No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the derivation; the method is a direct application plus task-specific engineering whose validity rests on external dataset results rather than reducing to its own inputs by construction. The central claim is therefore self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current MLLMs can produce aligned embeddings for images and structured recipe text when prompted appropriately, plus the empirical claim that the reported augmentation improves robustness; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Multimodal LLMs can generate effective joint embeddings for images and structured text (title, ingredients, instructions) when given suitable prompts
Invoked by the choice to replace dual encoders with VLM2Vec and custom prompts.

pith-pipeline@v0.9.0 · 5565 in / 1284 out tokens · 54280 ms · 2026-05-10T08:30:26.558637+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 10 canonical work pages · 6 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint arXiv:2404.14219, 2024. 2

work page internal anchor Pith review arXiv 2024
[2]

Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Em- beddings

Micael Carvalho, R ´emi Cad`ene, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Em- beddings. InProceedings of ACM SIGIR Conference on Re- search and Development in Information Retrieval, pages 35– 44, 2018. 3, 7, 16, 19, 20

2018
[3]

Deep Understanding of Cooking Procedure for Cross- modal Recipe Retrieval

Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, and Tat-Seng Chua. Deep Understanding of Cooking Procedure for Cross- modal Recipe Retrieval. InProceedings of ACM Interna- tional Conference on Multimedia, pages 1020–1028, 2018. 3, 16, 19, 20

2018
[4]

UNITER: Universal Image-TExt Representation Learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: Universal Image-TExt Representation Learning. InProceedings of European Conference on Computer Vision, pages 104–120, 2020. 2

2020
[5]

ImageNet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255,
[6]

Fleet, Jamie Ryan Kiros, and Sanja Fidler

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. InProceedings of British Machine Vi- sion Conference, 2018. 2

2018
[7]

Dividing and Conquering Cross- Modal Recipe Retrieval: From Nearest Neighbours Base- lines to SoTA.arXiv preprint arXiv:1911.12763, 2019

Mikhail Fain, Niall Twomey, Andrey Ponikar, Ryan Fox, and Danushka Bollegala. Dividing and Conquering Cross- Modal Recipe Retrieval: From Nearest Neighbours Base- lines to SoTA.arXiv preprint arXiv:1911.12763, 2019. 3, 16, 19, 20

work page arXiv 1911
[8]

Corrado, Jonathon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov

Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. DeViSE: A Deep Visual-Semantic Embedding Model. InAdvances in Neural Information Processing Sys- tems, 2013. 2

2013
[9]

MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model

Han Fu, Rui Wu, Chenghao Liu, and Jianling Sun. MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14558–14568, 2020. 3, 7, 16, 19, 20

2020
[10]

Scaling Deep Contrastive Learning Batch Size under Mem- ory Limited Setup

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling Deep Contrastive Learning Batch Size under Mem- ory Limited Setup. InProceedings of Workshop on Repre- sentation Learning for NLP, pages 316–321, 2021. 5

2021
[11]

Pham, and Vladimir Pavlovic

Ricardo Guerrero, Hai X. Pham, and Vladimir Pavlovic. Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace Learning. InProceed- ings of ACM International Conference on Multimedia, pages 3192–3201, 2021. 3, 7, 16, 19, 20

2021
[12]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. InProceed- ings of IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 770–778, 2016. 3

2016
[13]

Long Short-Term Memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and J ¨urgen Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8):1735–1780, 1997. 3

1997
[14]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In 11 Quer y Gr ound T rut h T op 5 R etrie v ed D AR Ours D AR Ours D AR Ours Figure 5. Qualitative comparison of image-to-recipe retrieval between SIMMER (V1-7B) and DAR [40]. For each query ...

2022
[15]

Improv- ing Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding

Xu Huang, Jin Liu, Zhizhong Zhang, and Yuan Xie. Improv- ing Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding. InProceedings of ACM Inter- national Conference on Multimedia, pages 529–537, 2023. 3, 7, 17, 19, 20

2023
[16]

Scaling Up Visual and Vision-Language Repre- sentation Learning With Noisy Text Supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling Up Visual and Vision-Language Repre- sentation Learning With Noisy Text Supervision. InPro- ceedings of International Conference on Machine Learning, pages 4904–4916, 2021. 2

2021
[17]

VLM2Vec: Training Vision- Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2Vec: Training Vision- Language Models for Massive Multimodal Embedding Tasks. InProceedings of International Conference on Learn- ing Representations, 2025. 2, 4

2025
[18]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. InProceedings of International Conference on Learning Representations, 2015. 5

2015
[19]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models.arXiv preprint arXiv:1411.2539,

work page arXiv
[20]

Stacked Cross Attention for Image-Text Match- ing

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi- aodong He. Stacked Cross Attention for Image-Text Match- ing. InProceedings of European Conference on Computer Vision, pages 201–216, 2018. 2

2018
[21]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. LLaV A-NeXT- Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models.arXiv preprint arXiv:2407.07895,

work page internal anchor Pith review arXiv
[22]

Cross-Modal Image-Recipe Retrieval via Intra- and Inter- Modality Hybrid Fusion

Jiao Li, Jialiang Sun, Xing Xu, Wei Yu, and Fumin Shen. Cross-Modal Image-Recipe Retrieval via Intra- and Inter- Modality Hybrid Fusion. InProceedings of ACM Interna- tional Conference on Multimedia Retrieval, pages 173–182,
[23]

in a medium saucepan o v er medium-high heat, r educe maple syrup b y a quar t er , 5 t o 7 minut es

3, 16, 19, 20 12 Tit le Ingr edient s Instructions Gr ound T rut h T op 5 R etrie v ed nutmeg- maple cr eam pie 3/4 cup maple syrup 2 1/4 cups hea vy cr eam 4 egg y olks 1 whole egg 1/4 t easpoon salt 1 t easpoon fr eshly grat ed nutmeg 1 t easpoon v anilla e xtract 1 pr e-bak ed 9-inch pie crust (see r ecipe) pr eheat o v en t o 300 degr ees. in a medium...
[24]

Hybrid Fusion with Intra- and Cross- Modality Attention for Image-Recipe Retrieval

Jiao Li, Xing Xu, Wei Yu, Fumin Shen, Zuo Cao, Kai Zuo, and Heng Tao Shen. Hybrid Fusion with Intra- and Cross- Modality Attention for Image-Recipe Retrieval. InProceed- ings of ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 244–254, 2021. 3, 16, 19, 20

2021
[25]

BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. In Proceedings of International Conference on Machine Learn- ing, pages 12888–12900, 2022. 2

2022
[26]

Multi-subspace Implicit Alignment for Cross-modal Retrieval on Cooking Recipes and Food Images

Lin Li, Ming Li, Zichen Zan, Qing Xie, and Jianquan Liu. Multi-subspace Implicit Alignment for Cross-modal Retrieval on Cooking Recipes and Food Images. InPro- ceedings of ACM International Conference on Information and Knowledge Management, pages 3211–3215, 2021. 3, 16, 19, 20

2021
[27]

Cross-modal Image-Recipe Retrieval via Multimodal Fusion

Lijie Li, Caiyue Hu, Haitao Zhang, and Akshita Maradapu Vera Venkata sai. Cross-modal Image-Recipe Retrieval via Multimodal Fusion. InProceedings of ACM International Conference on Multimedia in Asia, pages 1–7, 2023. 17, 19, 20

2023
[28]

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs. In Proceedings of International Conference on Learning Rep- resentations, 2025. 2

2025
[29]

Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders

Wenhao Liu, Simiao Yuan, Zhen Wang, Xinyi Chang, Li- meng Gao, and Zhenrui Zhang. Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders. Mathematics, 12(20):3181, 2024. 3, 17, 19, 20

2024
[30]

ViL- BERT: Pretraining Task-Agnostic Visiolinguistic Represen- tations for Vision-and-Language Tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViL- BERT: Pretraining Task-Agnostic Visiolinguistic Represen- tations for Vision-and-Language Tasks. InAdvances in Neu- ral Information Processing Systems, 2019. 2

2019
[31]

arXiv preprint arXiv:2507.04590 , year=

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents.arXiv preprint arXiv:2507.04590, 2025. 2, 3, 4

work page arXiv 2025
[32]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vec- tor Space.arXiv preprint arXiv:1301.3781, 2013. 3

work page internal anchor Pith review arXiv 2013
[33]

Papadopoulos, Enrique Mora, Nadiia Chepurko, Kuan Wei Huang, Ferda Ofli, and Antonio Torralba

Dim P. Papadopoulos, Enrique Mora, Nadiia Chepurko, Kuan Wei Huang, Ferda Ofli, and Antonio Torralba. Learn- ing Program Representations for Food Images and Cooking Recipes. InProceedings of IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 16538–16548,
[34]

Pham, Ricardo Guerrero, Vladimir Pavlovic, and Ji- atong Li

Hai X. Pham, Ricardo Guerrero, Vladimir Pavlovic, and Ji- atong Li. CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval. InProceedings of AAAI Conference on Artificial Intelligence, pages 2423–2430, 2021. 3, 16, 19, 20

2021
[35]

Learning Transferable Visual 13 Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual 13 Models From Natural Language Supervision. InProceedings of International Conference on Machine Learning, pages 8748–8763, 2021. 2, 3

2021
[36]

Learn- ing Cross-Modal Embeddings for Cooking Recipes and Food Images

Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learn- ing Cross-Modal Embeddings for Cooking Recipes and Food Images. InProceedings of IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 3068–3076,
[37]

2, 3, 5, 7, 16, 19, 20
[38]

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learn- ing

Amaia Salvador, Erhan Gundogdu, Loris Bazzani, and Michael Donoser. Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learn- ing. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15470–15479, 2021. 2, 3, 5, 7, 16, 19, 20

2021
[39]

Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

Mustafa Shukor, Guillaume Couairon, Asya Grechka, and Matthieu Cord. Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval. InProceed- ings of IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops, pages 4566–4577, 2022. 2, 3, 7, 17, 19, 20

2022
[40]

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval.Computer Vision and Image Understanding, 247 (C), 2024

Mustafa Shukor, Nicolas Thome, and Matthieu Cord. Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval.Computer Vision and Image Understanding, 247 (C), 2024. 3, 7, 17, 19, 20

2024
[41]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very Deep Convo- lutional Networks for Large-Scale Image Recognition.arXiv preprint arXiv:1409.1556, 2014. 3

work page internal anchor Pith review Pith/arXiv arXiv 2014
[42]

En- hancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective

Fangzhou Song, Bin Zhu, Yanbin Hao, and Shuo Wang. En- hancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective. InProceedings of European Conference on Computer Vision, pages 111–127, 2024. 3, 7, 9, 12, 13, 17, 19, 20

2024
[43]

Cross-Modal Recipe Embed- dings by Disentangling Recipe Contents and Dish Styles

Yu Sugiyama and Keiji Yanai. Cross-Modal Recipe Embed- dings by Disentangling Recipe Contents and Dish Styles. In Proceedings of ACM International Conference on Multime- dia, pages 2501–2509, 2021. 3, 7, 16, 19, 20

2021
[44]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen- tation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2018. 5

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Attention is All you Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. InAdvances in Neu- ral Information Processing Systems, 2017. 3

2017
[46]

MALM: Mask Augmentation based Lo- cal Matching for Food-Recipe Retrieval.arXiv preprint arXiv:2305.11327, 2023

Bhanu Prakash V outharoja, Peng Wang, Lei Wang, and Vivienne Guan. MALM: Mask Augmentation based Lo- cal Matching for Food-Recipe Retrieval.arXiv preprint arXiv:2305.11327, 2023. 17, 19, 20

work page arXiv 2023
[47]

Fine-Grained Alignment for Cross-Modal Recipe Retrieval

Muntasir Wahed, Xiaona Zhou, Tianjiao Yu, and Ismini Lourentzou. Fine-Grained Alignment for Cross-Modal Recipe Retrieval. InProceedings of IEEE Winter Confer- ence on Applications of Computer Vision, pages 5572–5581,
[48]

Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven C. H. Hoi. Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Im- ages. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11564–11573, 2019. 2, 3, 7, 16, 19, 20

2019
[49]

Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval

Hao Wang, Guosheng Lin, Steven Hoi, and Chunyan Miao. Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval. InProceedings of ACM Interna- tional Conference on Multimedia, pages 5517–5526, 2022. 16, 19, 20

2022
[50]

Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Palakorn Achananuparp, Ee-peng Lim, and Steven C. H. Hoi. Cross- Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes With Semantic Consistency and Atten- tion Mechanism.IEEE Transactions on Multimedia, 24: 2515–2525, 2022. 3, 16, 19, 20

2022
[51]

Hao Wang, Guosheng Lin, Steven C. H. Hoi, and Chunyan Miao. Learning Structural Representations for Recipe Gen- eration and Food Retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3363–3377, 2023. 17, 19, 20

2023
[52]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, et al. Qwen2-VL: Enhancing Vision- Language Model’s Perception of the World at Any Resolu- tion.arXiv preprint arXiv:2409.12191, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Mitigating Cross-modal Representation Bias for Multicul- tural Image-to-Recipe Retrieval

Qing Wang, Chong-Wah Ngo, Yu Cao, and Ee-Peng Lim. Mitigating Cross-modal Representation Bias for Multicul- tural Image-to-Recipe Retrieval. InProceedings of ACM International Conference on Multimedia, pages 6223–6231,
[54]

Learn- ing Joint Embedding with Modality Alignments for Cross- Modal Retrieval of Recipes and Food Images

Zhongwei Xie, Ling Liu, Lin Li, and Luo Zhong. Learn- ing Joint Embedding with Modality Alignments for Cross- Modal Retrieval of Recipes and Food Images. InProceed- ings of ACM International Conference on Information and Knowledge Management, pages 2221–2230, 2021. 3, 16, 19, 20

2021
[55]

Learning Text-image Joint Embedding for Effi- cient Cross-modal Retrieval with Deep Feature Engineer- ing.ACM Transactions on Information Systems, 40(4):74:1– 74:27, 2021

Zhongwei Xie, Ling Liu, Yanzhao Wu, Luo Zhong, and Lin Li. Learning Text-image Joint Embedding for Effi- cient Cross-modal Retrieval with Deep Feature Engineer- ing.ACM Transactions on Information Systems, 40(4):74:1– 74:27, 2021. 3, 16, 19, 20

2021
[56]

Cross-Modal Retrieval between Event-Dense Text and Image

Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, and Ling Liu. Cross-Modal Retrieval between Event-Dense Text and Image. InProceedings of ACM International Conference on Multimedia Retrieval, pages 229–238, 2022. 3, 16, 19, 20

2022
[57]

Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service.IEEE Trans- actions on Services Computing, 15(6):3304–3316, 2022

Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, and Luo Zhong. Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service.IEEE Trans- actions on Services Computing, 15(6):3304–3316, 2022. 3, 17, 19, 20

2022
[58]

Transformer- Based Cross-Modal Recipe Embeddings with Large Batch Training

Jing Yang, Junwen Chen, and Keiji Yanai. Transformer- Based Cross-Modal Recipe Embeddings with Large Batch Training. InProceedings of International Conference on Multimedia Modeling, pages 471–482, 2023. 2, 3, 7, 17, 19, 20

2023
[59]

Improving Cross- Modal Recipe Embeddings with Cross Decoder

Jing Yang, Junwen Chen, and Keiji Yanai. Improving Cross- Modal Recipe Embeddings with Cross Decoder. InProceed- ings of ACM Workshop on Intelligent Cross-Data Analysis and Retrieval, pages 1–4, 2024. 3, 6, 7, 17, 19, 20

2024
[60]

RecipeRAG: Advancing 14 Recipe Generation with Reinforced Retrieval Augmented Generation

Jinghan Yang, Zhenbo Xu, Dehua Ma, Liu Liu, Fei Liu, Gong Huang, and Zhaofeng He. RecipeRAG: Advancing 14 Recipe Generation with Reinforced Retrieval Augmented Generation. InProceedings of ACM International Confer- ence on Multimedia, pages 5060–5069, 2025. 3, 17, 19, 20

2025
[61]

Sentence- based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images

Zichen Zan, Lin Li, Jianquan Liu, and Dong Zhou. Sentence- based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images. InProceedings of ACM Interna- tional Conference on Multimedia Retrieval, pages 117–125,
[62]

Sigmoid Loss for Language Image Pre- Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. InProceedings of IEEE International Conference on Computer Vision, pages 11975–11986, 2023. 2

2023
[63]

Cross-modal recipe retrieval based on unified text encoder with fine- grained contrastive learning.Knowledge-Based Systems, 305:112641, 2024

Bolin Zhang, Haruya Kyutoku, Keisuke Doman, Takahiro Komamizu, Ichiro Ide, and Jiangbo Qian. Cross-modal recipe retrieval based on unified text encoder with fine- grained contrastive learning.Knowledge-Based Systems, 305:112641, 2024. 3, 7, 17, 19, 20

2024
[64]

Cross modal recipe retrieval with fine grained modal interaction

Fan Zhao, Yuqing Lu, Zhuo Yao, and Fangying Qu. Cross modal recipe retrieval with fine grained modal interaction. Scientific Reports, 15(1):4842, 2025. 3, 7, 17, 19, 20

2025
[65]

Exploring latent weight factors and global infor- mation for food-oriented cross-modal retrieval.Connection Science, 35(1):2233714, 2023

Wenyu Zhao, Dong Zhou, Buqing Cao, Wei Liang, and Nitin Sukhija. Exploring latent weight factors and global infor- mation for food-oriented cross-modal retrieval.Connection Science, 35(1):2233714, 2023. 17, 19, 20

2023
[66]

Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval.Multi- media Tools and Applications, 83(2):3601–3619, 2024

Wenyu Zhao, Dong Zhou, Buqing Cao, Kai Zhang, and Jin- jun Chen. Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval.Multi- media Tools and Applications, 83(2):3601–3619, 2024. 17, 19, 20

2024
[67]

R2GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network

Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. R2GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network. InProceedings of IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11469–11478, 2019. 2, 3, 16, 19, 20

2019
[68]

CREAMY: Cross-Modal Recipe Retrieval By Avoid- ing Matching Imperfectly.IEEE Access, 12:33283–33295,

Zhuoyang Zou, Xinghui Zhu, Qinying Zhu, Yi Liu, and Lei Zhu. CREAMY: Cross-Modal Recipe Retrieval By Avoid- ing Matching Imperfectly.IEEE Access, 12:33283–33295,
[69]

dish shape/style

Zhuoyang Zou, Xinghui Zhu, Qinying Zhu, Hongyan Zhang, and Lei Zhu. Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval.Foods, 13(11):1628, 2024. 3, 17, 19, 20 15 A. Baseline Methods and Full Experimental Results A.1. Baseline Descriptions We compare SIMMER against the following methods for cross-modal fo...

2024