pith. machine review for the scientific record. sign in

arxiv: 2604.11913 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords nutrition estimationegocentric cooking videosdish-level analysisvideo feature fusionkeyframe selectionfood image recognitionmacronutrient predictionmultimodal visual reasoning
0
0 comments X

The pith

Cooking process keyframes from egocentric videos supply complementary nutritional details that final dish images alone obscure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that nutrition estimation from completed meal photos is limited because oils, sauces, and mixed components lose visual clarity after cooking. It tests whether pulling selected frames from the cooking sequence, specifically moments of ingredient addition, and fusing their features with the final frame can recover some of that lost information. A staged model is built that starts with backbones pretrained on large food datasets, adds a lightweight fusion step, and uses an event detector to choose the relevant keyframes. Experiments on an annotated subset of the HD-EPIC dataset show measurable gains under controlled settings. The gains depend on both the strength of the visual backbone and the accuracy of the keyframe selection.

Core claim

V-Nutri combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted by a VideoMamba-based event detector targeting ingredient-addition moments; on the newly annotated HD-EPIC benchmark this yields improved calorie and macronutrient estimates compared with final-frame-only baselines.

What carries the argument

The lightweight fusion module that combines visual features from the final dish frame with those from cooking keyframes selected by a VideoMamba-based event detector targeting ingredient-addition moments.

If this is right

  • Process cues from the cooking sequence can supplement final-dish visuals for more accurate calorie and macronutrient estimates.
  • The size of the improvement scales with the capacity of the visual backbone and the precision of the ingredient-addition detector.
  • A public benchmark now exists for evaluating video-based dish nutrition methods on the annotated HD-EPIC data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged fusion idea could be tested on longer, unscripted home videos where ingredient order is less predictable.
  • If keyframe selection improves, the approach might reduce the need for perfectly lit final-dish photos in wearable dietary trackers.
  • The dependence on backbone capacity suggests that stronger general video models could widen the benefit without changing the fusion design.

Load-bearing premise

The keyframes chosen by the event detector contain nutritional information not visible in the final dish and that this information can be combined without adding noise.

What would settle it

Running the same fusion architecture on the HD-EPIC test set but replacing the selected process keyframes with random or empty frames and observing no accuracy gain or a drop would falsify the claim of complementary evidence.

Figures

Figures reproduced from arXiv: 2604.11913 by Chengkun Yue, Chuanzhi Xu, Jiangpeng He.

Figure 1
Figure 1. Figure 1: Traditional nutrition estimation is process-blind, relying [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of V-Nutri (Inference). Given an egocentric cooking video, the process keyframes and final dish frame are selected [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Nutrition distribution of 52 benchmark instances. Violin [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Predicted vs. ground-truth nutrient values for the ViT [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at https://github.com/K624-YCK/V-Nutri.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces V-Nutri, a staged framework for dish-level nutrition estimation from egocentric cooking videos. It further annotates the HD-EPIC dataset to establish the first benchmark for this task and combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking-process keyframes. Keyframes are selected by a VideoMamba-based event detector targeting ingredient-addition moments. Experiments on HD-EPIC show that process cues supply complementary nutritional evidence and improve estimation under controlled conditions, with gains depending on backbone capacity and event-detection quality. Code and annotations are released.

Significance. If the reported gains hold, the work advances visual nutrition estimation by demonstrating that temporal cooking-process information can resolve ambiguities (e.g., oils, sauces) that persist in single final-dish images. The explicit release of the annotated HD-EPIC benchmark and reproducible code is a clear strength that supports further research in computational health applications.

minor comments (2)
  1. Abstract: the claim that process cues 'improve nutrition estimation under controlled conditions' is presented without any numerical results, baselines, or definition of the controlled conditions; adding one or two key quantitative findings would make the abstract a more informative summary of the central result.
  2. The manuscript would benefit from a brief explicit statement of the precise evaluation metrics (e.g., MAE or percentage error on calories/macronutrients) and the exact train/test splits used on HD-EPIC, even if they appear in the tables.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the significance of incorporating cooking-process information for nutrition estimation, and the recommendation for minor revision. We are pleased that the contributions of the annotated HD-EPIC benchmark and released code are viewed as strengths.

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper proposes an empirical staged framework (V-Nutri) that fuses Nutrition5K-pretrained backbones with a lightweight module on newly annotated HD-EPIC video data and VideoMamba event detection. All load-bearing elements (pretrained weights, annotations, fusion architecture, and reported accuracy gains) are externally sourced or experimentally measured rather than defined in terms of the target nutrition estimates. No equations reduce by construction, no fitted parameters are relabeled as predictions, and no uniqueness theorems or ansatzes are imported via self-citation. The central claim rests on controlled experiments whose inputs are independent of the output metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework depends on pre-trained external models and assumes effective event detection; limited abstract information prevents exhaustive listing of all free parameters.

free parameters (1)
  • lightweight fusion module parameters
    Trainable weights in the fusion module that combine final-dish and keyframe features.
axioms (2)
  • domain assumption Nutrition5K-pretrained visual backbones capture features useful for nutrition estimation
    The method builds directly on these pre-trained models without retraining from scratch.
  • domain assumption VideoMamba-based model accurately detects ingredient-addition events for keyframe selection
    The cooking keyframes selection module relies on this detection quality.

pith-pipeline@v0.9.0 · 5532 in / 1340 out tokens · 63303 ms · 2026-05-10T14:55:46.690685+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text.Advances in neural information processing systems, 34:24206–24221, 2021

    Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text.Advances in neural information processing systems, 34:24206–24221, 2021. 2

  2. [2]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6836–6846,

  3. [3]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1728–1738,

  4. [4]

    Is space-time attention all you need for video understanding? InProceedings of the 38th International Conference on Ma- chine Learning, pages 813–824

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the 38th International Conference on Ma- chine Learning, pages 813–824. PMLR, 2021. 2

  5. [5]

    Flexible frame selection for efficient video rea- soning

    Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video rea- soning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29071–29082, 2025. 2

  6. [6]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4

  7. [7]

    Implicit-scale 3d reconstruction for multi-food volume estimation from monocular images

    Yuhao Chen, Gautham Vinod, Siddeshwar Raghavan, Talha Ibn Mahmud, Bruce Coburn, Jinge Ma, Fengqing Zhu, and Jiangpeng He. Implicit-scale 3d reconstruction for multi-food volume estimation from monocular images. arXiv preprint arXiv:2602.13041, 2026. 1

  8. [8]

    Evaluat- ing large multimodal models for nutrition analysis: A new benchmark enriched with contextual metadata

    Bruce Coburn, Jiangpeng He, Megan E Rollo, Satvinder S Dhaliwal, Deborah A Kerr, and Fengqing Zhu. Evaluat- ing large multimodal models for nutrition analysis: A new benchmark enriched with contextual metadata. InProc. IEEE International Conference on Biomedical and Health Informatics (BHI), 2025. 2

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 4

  10. [10]

    Image-based estimation of real food size for accurate food calorie estimation

    Takumi Ege, Yoshikazu Ando, Ryosuke Tanno, Wataru Shi- moda, and Keiji Yanai. Image-based estimation of real food size for accurate food calorie estimation. In2019 IEEE Con- ference on Multimedia Information Processing and Retrieval (MIPR), pages 274–279, 2019. 2

  11. [11]

    Multiscale vision transformers

    Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021. 2

  12. [12]

    Rgb-d food nutrient estimation supported by flava contrastive learning

    Yihang Feng, Yi Wang, Xinhao Wang, Bo Zhao, Jinbo Bi, Song Han, Zhenlei Xiao, and Yangchao Luo. Rgb-d food nutrient estimation supported by flava contrastive learning. Journal of Food Composition and Analysis, page 108821,

  13. [13]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2

  14. [14]

    Online continual learn- ing for visual food classification

    Jiangpeng He and Fengqing Zhu. Online continual learn- ing for visual food classification. InProceedings of the IEEE/CVF international conference on computer vision, pages 2337–2346, 2021. 1

  15. [15]

    Multi-task image-based dietary assessment for food recognition and portion size es- timation

    Jiangpeng He, Zeman Shao, Janine Wright, Deborah Kerr, Carol Boushey, and Fengqing Zhu. Multi-task image-based dietary assessment for food recognition and portion size es- timation. In2020 IEEE Conference on Multimedia Informa- tion Processing and Retrieval, pages 49–54. IEEE, 2020. 1

  16. [16]

    An end-to-end food image analysis system.arXiv preprint arXiv:2102.00645, 2021

    Jiangpeng He, Runyu Mao, Zeman Shao, Janine L Wright, Deborah A Kerr, Carol J Boushey, and Fengqing Zhu. An end-to-end food image analysis system.arXiv preprint arXiv:2102.00645, 2021. 1

  17. [17]

    Long-tailed continual learning for visual food recognition.IEEE trans- actions on multimedia, 28:865–877, 2025

    Jiangpeng He, Xiaoyan Zhang, Luotao Lin, Jack Ma, Heather A Eicher-Miller, and Fengqing Zhu. Long-tailed continual learning for visual food recognition.IEEE trans- actions on multimedia, 28:865–877, 2025. 1

  18. [18]

    Physically Informed 3D Food Reconstruction: Methods and results.IEEE Journal of Biomedical and Health Informatics,

    Jiangpeng He, Yuhao Chen, Gautham Vinod, Xiaoyan Zhang, Talha Ibn Mahmud, Ahmad AlMughrabi, Umair Ha- roon, Ricardo Marques, Petia Radeva, Yawei Jueluo, et al. Physically Informed 3D Food Reconstruction: Methods and results.IEEE Journal of Biomedical and Health Informatics,

  19. [19]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4

  20. [20]

    mmSampler: Efficient frame sampler for multimodal video retrieval

    Zhiming Hu, Ning Ye, and Iqbal Mohomed. mmSampler: Efficient frame sampler for multimodal video retrieval. In Proceedings of Machine Learning and Systems, pages 153– 171, 2022. 2

  21. [21]

    Decomposing Food Images for Better Nutrition Analysis: a nutritionist-inspired two-step multimodal llm approach

    Pitikorn Khlaisamniang, Kun Kerdthaisong, Supasate V o- rathammathorn, Nutchanon Yongsatianchot, Hirunkul Phim- siri, Amrest Chinkamol, Teermade Thitseesaeng, Kanyakorn Veerakanjana, Kaisorn Kachai, Piyalitt Ittichaiwong, et al. Decomposing Food Images for Better Nutrition Analysis: a nutritionist-inspired two-step multimodal llm approach. In Proceedings ...

  22. [22]

    Video token merging for long video understand- ing

    Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li. Video token merging for long video understand- ing. InAdvances in Neural Information Processing Systems, pages 13851–13871. Curran Associates, Inc., 2024. 2

  23. [23]

    Less is more: Clipbert for video-and-language learning via sparse sampling

    Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7331–7341, 2021. 2

  24. [24]

    Uniformer: Unified transformer for efficient spatiotemporal representation learning

    Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Unified transformer for efficient spatiotemporal representation learning.arXiv preprint arXiv:2201.04676, 2022. 2

  25. [25]

    Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer.arXiv preprint arXiv:2211.09552, 2022. 2

  26. [26]

    Videomamba: State space model for efficient video understanding

    Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–255. Springer, 2024. 2, 3, 4, 6

  27. [27]

    Ocsampler: Compressing videos to one clip with single-step sampling

    Jintao Lin, Haodong Duan, Kai Chen, Dahua Lin, and Limin Wang. Ocsampler: Compressing videos to one clip with single-step sampling. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13894–13903, 2022. 2

  28. [28]

    Bmn: Boundary-matching network for temporal action pro- posal generation

    Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary-matching network for temporal action pro- posal generation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 3889–3898,

  29. [29]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 2

  30. [30]

    Lo, Jianing Qiu, Modou L

    Frank P.-W. Lo, Jianing Qiu, Modou L. Jobarteh, Yingnan Sun, Zeyu Wang, Shuo Jiang, Tom Baranowski, Alex K. An- derson, Megan A. McCrory, Edward Sazonov, Wenyan Jia, Mingui Sun, Matilda Steiner-Asiedu, Gary Frost, and Benny Lo. AI-enabled wearable cameras for assisting dietary as- sessment in african populations.npj Digital Medicine, 7(1): 356, 2024. 2

  31. [31]

    Partially supervised multi-task network for single-view di- etary assessment

    Ya Lu, Thomai Stathopoulou, and Stavroula Mougiakakou. Partially supervised multi-task network for single-view di- etary assessment. In2020 25th International Conference on Pattern Recognition (ICPR), pages 8156–8163, 2021. 2

  32. [32]

    An improved encoder-decoder framework for food energy estimation

    Jack Ma, Jiangpeng He, and Fengqing Zhu. An improved encoder-decoder framework for food energy estimation. In Proceedings of the 8th International Workshop on Multime- dia Assisted Dietary Management, pages 53–59, 2023. 1

  33. [33]

    Mfp3d: Monocular food portion estimation leveraging 3d point clouds

    Jinge Ma, Xiaoyan Zhang, Gautham Vinod, Siddeshwar Raghavan, Jiangpeng He, and Fengqing Zhu. Mfp3d: Monocular food portion estimation leveraging 3d point clouds. InInternational Conference on Pattern Recognition, pages 49–62. Springer, 2024. 1

  34. [34]

    Visual aware hierarchy based food recognition

    Runyu Mao, Jiangpeng He, Zeman Shao, Sri Kalyan Yarla- gadda, and Fengqing Zhu. Visual aware hierarchy based food recognition. InInternational conference on pattern recogni- tion, pages 571–598. Springer, 2021. 1

  35. [35]

    Im2Calories: Towards an automated mobile vision food diary

    Austin Myers, Nick Johnston, Vivek Rathod, Anoop Ko- rattikara, Alexander Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin Murphy. Im2Calories: Towards an automated mobile vision food diary. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1233–1241,

  36. [36]

    HD- EPIC: A highly-detailed egocentric video dataset

    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Pra- jwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. HD- EPIC: A highly-detailed egocentric video dataset. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025. 2, 4, 5

  37. [37]

    Ravelli and Dale A

    Michele N. Ravelli and Dale A. Schoeller. Traditional self- reported dietary instruments are prone to inaccuracies and new approaches are needed.Frontiers in Nutrition, V olume 7 - 2020, 2020. 1

  38. [38]

    Predicting the future from first person (egocentric) vision: A survey.Computer Vision and Image Understanding, 211:103252, 2021

    Ivan Rodin, Antonino Furnari, Dimitrios Mavroeidis, and Giovanni Maria Farinella. Predicting the future from first person (egocentric) vision: A survey.Computer Vision and Image Understanding, 211:103252, 2021. 2

  39. [39]

    TokenLearner: Adaptive space-time tokenization for videos

    Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. TokenLearner: Adaptive space-time tokenization for videos. InAdvances in Neural Information Processing Systems, pages 12786–12797. Cur- ran Associates, Inc., 2021. 2

  40. [40]

    Learning cross-modal embeddings for cooking recipes and food im- ages

    Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learning cross-modal embeddings for cooking recipes and food im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 3020–3028,

  41. [41]

    Inverse cooking: Recipe generation from food images

    Amaia Salvador, Michal Drozdzal, Xavier Gir ´o-i Nieto, and Adriana Romero. Inverse cooking: Recipe generation from food images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10453– 10462, 2019. 2

  42. [42]

    Rapid non-destructive analysis of food nutrient content using swin-nutrition.Foods, 11(21):3429, 2022

    Wenjing Shao, Sujuan Hou, Weikuan Jia, and Yuanjie Zheng. Rapid non-destructive analysis of food nutrient content using swin-nutrition.Foods, 11(21):3429, 2022. 1

  43. [43]

    Towards learning food portion from monoc- ular images with cross-domain feature adaptation

    Zeman Shao, Shaobo Fang, Runyu Mao, Jiangpeng He, Janine L Wright, Deborah A Kerr, Carol J Boushey, and Fengqing Zhu. Towards learning food portion from monoc- ular images with cross-domain feature adaptation. In2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), pages 1–6. IEEE, 2021. 1

  44. [44]

    An end-to-end food portion estimation framework based on shape reconstruction from monocular image

    Zeman Shao, Gautham Vinod, Jiangpeng He, and Fengqing Zhu. An end-to-end food portion estimation framework based on shape reconstruction from monocular image. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 942–947, 2023. 1

  45. [45]

    Adaptive keyframe sampling for long video understanding

    Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29118–29128, 2025. 2

  46. [46]

    Nutrition5k: To- wards automatic nutritional understanding of generic food

    Quin Thames, Arjun Karpur, Wade Norris, Fangting Xia, Liviu Panait, Tobias Weyand, and Jack Sim. Nutrition5k: To- wards automatic nutritional understanding of generic food. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 8903–8911, 2021. 1, 2, 3, 4, 5, 6

  47. [47]

    VideoMAE: Masked autoencoders are data-efficient learn- ers for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learn- ers for self-supervised video pre-training. InConference on Neural Information Processing Systems, 2022. 2

  48. [48]

    Food portion estimation via 3d object scaling

    Gautham Vinod, Jiangpeng He, Zeman Shao, and Fengqing Zhu. Food portion estimation via 3d object scaling. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3741–3749, 2024. 1

  49. [49]

    Size Matters: Recon- structing real-scale 3d models from monocular images for food portion estimation.arXiv preprint arXiv:2601.20051,

    Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan, Jiangpeng He, and Fengqing Zhu. Size Matters: Recon- structing real-scale 3d models from monocular images for food portion estimation.arXiv preprint arXiv:2601.20051,

  50. [50]

    Omnivl: One foundation model for image-language and video-language tasks.Advances in neu- ral information processing systems, 35:5696–5710, 2022

    Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Lu- owei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, and Lu Yuan. Omnivl: One foundation model for image-language and video-language tasks.Advances in neu- ral information processing systems, 35:5696–5710, 2022. 2

  51. [51]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14549–14560, 2023. 2

  52. [52]

    Bevt: Bert pretraining of video transformers

    Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, and Lu Yuan. Bevt: Bert pretraining of video transformers. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14733–14743, 2022. 2

  53. [53]

    Internvideo: General video foundation models via generative and discriminative learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 2

  54. [54]

    Internvideo2: Scaling foundation models for mul- timodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024. 2

  55. [55]

    Masked feature predic- tion for self-supervised visual pre-training

    Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature predic- tion for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14668–14678, 2022. 2

  56. [56]

    Adaframe: Adaptive frame selection for fast video recognition

    Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, and Larry S Davis. Adaframe: Adaptive frame selection for fast video recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1278–1287, 2019. 2

  57. [57]

    NSNet: Non-saliency suppression sampler for efficient video recog- nition

    Boyang Xia, Wenhao Wu, Haoran Wang, Rui Su, Dongliang He, Haosen Yang, Xiaoran Fan, and Wanli Ouyang. NSNet: Non-saliency suppression sampler for efficient video recog- nition. InEuropean conference on computer vision, 2022. 2

  58. [58]

    Videoclip: Contrastive pre-training for zero-shot video-text understanding

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 6787–6800, 2021. 2

  59. [59]

    Dietai24 as a frame- work for comprehensive nutrition estimation using multi- modal large language models.Communications Medicine, 5(1):458, 2025

    Runze Yan, Hanqi Luo, Jiaying Lu, Darren Liu, Hannah Posluszny, Mehak Preet Dhaliwal, Janice MacLeod, Yao Qin, Carl Yang, Terry J Hartman, et al. Dietai24 as a frame- work for comprehensive nutrition estimation using multi- modal large language models.Communications Medicine, 5(1):458, 2025. 1, 2

  60. [60]

    IGSMNet: Ingredient-guided semantic modeling network for food nutrition estimation.Foods, 14 (21), 2025

    Donglin Zhang, Weixiang Shi, Boyuan Ma, Weiqing Min, and Xiao-Jun Wu. IGSMNet: Ingredient-guided semantic modeling network for food nutrition estimation.Foods, 14 (21), 2025. 1, 2