Recognition: unknown
V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos
Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3
The pith
Cooking process keyframes from egocentric videos supply complementary nutritional details that final dish images alone obscure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
V-Nutri combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted by a VideoMamba-based event detector targeting ingredient-addition moments; on the newly annotated HD-EPIC benchmark this yields improved calorie and macronutrient estimates compared with final-frame-only baselines.
What carries the argument
The lightweight fusion module that combines visual features from the final dish frame with those from cooking keyframes selected by a VideoMamba-based event detector targeting ingredient-addition moments.
If this is right
- Process cues from the cooking sequence can supplement final-dish visuals for more accurate calorie and macronutrient estimates.
- The size of the improvement scales with the capacity of the visual backbone and the precision of the ingredient-addition detector.
- A public benchmark now exists for evaluating video-based dish nutrition methods on the annotated HD-EPIC data.
Where Pith is reading between the lines
- The same staged fusion idea could be tested on longer, unscripted home videos where ingredient order is less predictable.
- If keyframe selection improves, the approach might reduce the need for perfectly lit final-dish photos in wearable dietary trackers.
- The dependence on backbone capacity suggests that stronger general video models could widen the benefit without changing the fusion design.
Load-bearing premise
The keyframes chosen by the event detector contain nutritional information not visible in the final dish and that this information can be combined without adding noise.
What would settle it
Running the same fusion architecture on the HD-EPIC test set but replacing the selected process keyframes with random or empty frames and observing no accuracy gain or a drop would falsify the claim of complementary evidence.
Figures
read the original abstract
Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at https://github.com/K624-YCK/V-Nutri.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces V-Nutri, a staged framework for dish-level nutrition estimation from egocentric cooking videos. It further annotates the HD-EPIC dataset to establish the first benchmark for this task and combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking-process keyframes. Keyframes are selected by a VideoMamba-based event detector targeting ingredient-addition moments. Experiments on HD-EPIC show that process cues supply complementary nutritional evidence and improve estimation under controlled conditions, with gains depending on backbone capacity and event-detection quality. Code and annotations are released.
Significance. If the reported gains hold, the work advances visual nutrition estimation by demonstrating that temporal cooking-process information can resolve ambiguities (e.g., oils, sauces) that persist in single final-dish images. The explicit release of the annotated HD-EPIC benchmark and reproducible code is a clear strength that supports further research in computational health applications.
minor comments (2)
- Abstract: the claim that process cues 'improve nutrition estimation under controlled conditions' is presented without any numerical results, baselines, or definition of the controlled conditions; adding one or two key quantitative findings would make the abstract a more informative summary of the central result.
- The manuscript would benefit from a brief explicit statement of the precise evaluation metrics (e.g., MAE or percentage error on calories/macronutrients) and the exact train/test splits used on HD-EPIC, even if they appear in the tables.
Simulated Author's Rebuttal
We thank the referee for the positive summary, recognition of the significance of incorporating cooking-process information for nutrition estimation, and the recommendation for minor revision. We are pleased that the contributions of the annotated HD-EPIC benchmark and released code are viewed as strengths.
Circularity Check
No significant circularity detected in derivation or claims
full rationale
The paper proposes an empirical staged framework (V-Nutri) that fuses Nutrition5K-pretrained backbones with a lightweight module on newly annotated HD-EPIC video data and VideoMamba event detection. All load-bearing elements (pretrained weights, annotations, fusion architecture, and reported accuracy gains) are externally sourced or experimentally measured rather than defined in terms of the target nutrition estimates. No equations reduce by construction, no fitted parameters are relabeled as predictions, and no uniqueness theorems or ansatzes are imported via self-citation. The central claim rests on controlled experiments whose inputs are independent of the output metrics.
Axiom & Free-Parameter Ledger
free parameters (1)
- lightweight fusion module parameters
axioms (2)
- domain assumption Nutrition5K-pretrained visual backbones capture features useful for nutrition estimation
- domain assumption VideoMamba-based model accurately detects ingredient-addition events for keyframe selection
Reference graph
Works this paper leans on
-
[1]
Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text.Advances in neural information processing systems, 34:24206–24221, 2021
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text.Advances in neural information processing systems, 34:24206–24221, 2021. 2
2021
-
[2]
Vivit: A video vision transformer
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6836–6846,
-
[3]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1728–1738,
-
[4]
Is space-time attention all you need for video understanding? InProceedings of the 38th International Conference on Ma- chine Learning, pages 813–824
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the 38th International Conference on Ma- chine Learning, pages 813–824. PMLR, 2021. 2
2021
-
[5]
Flexible frame selection for efficient video rea- soning
Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video rea- soning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29071–29082, 2025. 2
2025
-
[6]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4
2017
-
[7]
Implicit-scale 3d reconstruction for multi-food volume estimation from monocular images
Yuhao Chen, Gautham Vinod, Siddeshwar Raghavan, Talha Ibn Mahmud, Bruce Coburn, Jinge Ma, Fengqing Zhu, and Jiangpeng He. Implicit-scale 3d reconstruction for multi-food volume estimation from monocular images. arXiv preprint arXiv:2602.13041, 2026. 1
-
[8]
Evaluat- ing large multimodal models for nutrition analysis: A new benchmark enriched with contextual metadata
Bruce Coburn, Jiangpeng He, Megan E Rollo, Satvinder S Dhaliwal, Deborah A Kerr, and Fengqing Zhu. Evaluat- ing large multimodal models for nutrition analysis: A new benchmark enriched with contextual metadata. InProc. IEEE International Conference on Biomedical and Health Informatics (BHI), 2025. 2
2025
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 4
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Image-based estimation of real food size for accurate food calorie estimation
Takumi Ege, Yoshikazu Ando, Ryosuke Tanno, Wataru Shi- moda, and Keiji Yanai. Image-based estimation of real food size for accurate food calorie estimation. In2019 IEEE Con- ference on Multimedia Information Processing and Retrieval (MIPR), pages 274–279, 2019. 2
2019
-
[11]
Multiscale vision transformers
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021. 2
2021
-
[12]
Rgb-d food nutrient estimation supported by flava contrastive learning
Yihang Feng, Yi Wang, Xinhao Wang, Bo Zhao, Jinbo Bi, Song Han, Zhenlei Xiao, and Yangchao Luo. Rgb-d food nutrient estimation supported by flava contrastive learning. Journal of Food Composition and Analysis, page 108821,
-
[13]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Online continual learn- ing for visual food classification
Jiangpeng He and Fengqing Zhu. Online continual learn- ing for visual food classification. InProceedings of the IEEE/CVF international conference on computer vision, pages 2337–2346, 2021. 1
2021
-
[15]
Multi-task image-based dietary assessment for food recognition and portion size es- timation
Jiangpeng He, Zeman Shao, Janine Wright, Deborah Kerr, Carol Boushey, and Fengqing Zhu. Multi-task image-based dietary assessment for food recognition and portion size es- timation. In2020 IEEE Conference on Multimedia Informa- tion Processing and Retrieval, pages 49–54. IEEE, 2020. 1
2020
-
[16]
An end-to-end food image analysis system.arXiv preprint arXiv:2102.00645, 2021
Jiangpeng He, Runyu Mao, Zeman Shao, Janine L Wright, Deborah A Kerr, Carol J Boushey, and Fengqing Zhu. An end-to-end food image analysis system.arXiv preprint arXiv:2102.00645, 2021. 1
-
[17]
Long-tailed continual learning for visual food recognition.IEEE trans- actions on multimedia, 28:865–877, 2025
Jiangpeng He, Xiaoyan Zhang, Luotao Lin, Jack Ma, Heather A Eicher-Miller, and Fengqing Zhu. Long-tailed continual learning for visual food recognition.IEEE trans- actions on multimedia, 28:865–877, 2025. 1
2025
-
[18]
Physically Informed 3D Food Reconstruction: Methods and results.IEEE Journal of Biomedical and Health Informatics,
Jiangpeng He, Yuhao Chen, Gautham Vinod, Xiaoyan Zhang, Talha Ibn Mahmud, Ahmad AlMughrabi, Umair Ha- roon, Ricardo Marques, Petia Radeva, Yawei Jueluo, et al. Physically Informed 3D Food Reconstruction: Methods and results.IEEE Journal of Biomedical and Health Informatics,
-
[19]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4
2016
-
[20]
mmSampler: Efficient frame sampler for multimodal video retrieval
Zhiming Hu, Ning Ye, and Iqbal Mohomed. mmSampler: Efficient frame sampler for multimodal video retrieval. In Proceedings of Machine Learning and Systems, pages 153– 171, 2022. 2
2022
-
[21]
Decomposing Food Images for Better Nutrition Analysis: a nutritionist-inspired two-step multimodal llm approach
Pitikorn Khlaisamniang, Kun Kerdthaisong, Supasate V o- rathammathorn, Nutchanon Yongsatianchot, Hirunkul Phim- siri, Amrest Chinkamol, Teermade Thitseesaeng, Kanyakorn Veerakanjana, Kaisorn Kachai, Piyalitt Ittichaiwong, et al. Decomposing Food Images for Better Nutrition Analysis: a nutritionist-inspired two-step multimodal llm approach. In Proceedings ...
2025
-
[22]
Video token merging for long video understand- ing
Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li. Video token merging for long video understand- ing. InAdvances in Neural Information Processing Systems, pages 13851–13871. Curran Associates, Inc., 2024. 2
2024
-
[23]
Less is more: Clipbert for video-and-language learning via sparse sampling
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7331–7341, 2021. 2
2021
-
[24]
Uniformer: Unified transformer for efficient spatiotemporal representation learning
Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Unified transformer for efficient spatiotemporal representation learning.arXiv preprint arXiv:2201.04676, 2022. 2
-
[25]
Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer.arXiv preprint arXiv:2211.09552, 2022. 2
-
[26]
Videomamba: State space model for efficient video understanding
Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–255. Springer, 2024. 2, 3, 4, 6
2024
-
[27]
Ocsampler: Compressing videos to one clip with single-step sampling
Jintao Lin, Haodong Duan, Kai Chen, Dahua Lin, and Limin Wang. Ocsampler: Compressing videos to one clip with single-step sampling. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13894–13903, 2022. 2
2022
-
[28]
Bmn: Boundary-matching network for temporal action pro- posal generation
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary-matching network for temporal action pro- posal generation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 3889–3898,
-
[29]
Video swin transformer
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 2
2022
-
[30]
Lo, Jianing Qiu, Modou L
Frank P.-W. Lo, Jianing Qiu, Modou L. Jobarteh, Yingnan Sun, Zeyu Wang, Shuo Jiang, Tom Baranowski, Alex K. An- derson, Megan A. McCrory, Edward Sazonov, Wenyan Jia, Mingui Sun, Matilda Steiner-Asiedu, Gary Frost, and Benny Lo. AI-enabled wearable cameras for assisting dietary as- sessment in african populations.npj Digital Medicine, 7(1): 356, 2024. 2
2024
-
[31]
Partially supervised multi-task network for single-view di- etary assessment
Ya Lu, Thomai Stathopoulou, and Stavroula Mougiakakou. Partially supervised multi-task network for single-view di- etary assessment. In2020 25th International Conference on Pattern Recognition (ICPR), pages 8156–8163, 2021. 2
2021
-
[32]
An improved encoder-decoder framework for food energy estimation
Jack Ma, Jiangpeng He, and Fengqing Zhu. An improved encoder-decoder framework for food energy estimation. In Proceedings of the 8th International Workshop on Multime- dia Assisted Dietary Management, pages 53–59, 2023. 1
2023
-
[33]
Mfp3d: Monocular food portion estimation leveraging 3d point clouds
Jinge Ma, Xiaoyan Zhang, Gautham Vinod, Siddeshwar Raghavan, Jiangpeng He, and Fengqing Zhu. Mfp3d: Monocular food portion estimation leveraging 3d point clouds. InInternational Conference on Pattern Recognition, pages 49–62. Springer, 2024. 1
2024
-
[34]
Visual aware hierarchy based food recognition
Runyu Mao, Jiangpeng He, Zeman Shao, Sri Kalyan Yarla- gadda, and Fengqing Zhu. Visual aware hierarchy based food recognition. InInternational conference on pattern recogni- tion, pages 571–598. Springer, 2021. 1
2021
-
[35]
Im2Calories: Towards an automated mobile vision food diary
Austin Myers, Nick Johnston, Vivek Rathod, Anoop Ko- rattikara, Alexander Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin Murphy. Im2Calories: Towards an automated mobile vision food diary. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1233–1241,
-
[36]
HD- EPIC: A highly-detailed egocentric video dataset
Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Pra- jwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. HD- EPIC: A highly-detailed egocentric video dataset. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025. 2, 4, 5
2025
-
[37]
Ravelli and Dale A
Michele N. Ravelli and Dale A. Schoeller. Traditional self- reported dietary instruments are prone to inaccuracies and new approaches are needed.Frontiers in Nutrition, V olume 7 - 2020, 2020. 1
2020
-
[38]
Predicting the future from first person (egocentric) vision: A survey.Computer Vision and Image Understanding, 211:103252, 2021
Ivan Rodin, Antonino Furnari, Dimitrios Mavroeidis, and Giovanni Maria Farinella. Predicting the future from first person (egocentric) vision: A survey.Computer Vision and Image Understanding, 211:103252, 2021. 2
2021
-
[39]
TokenLearner: Adaptive space-time tokenization for videos
Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. TokenLearner: Adaptive space-time tokenization for videos. InAdvances in Neural Information Processing Systems, pages 12786–12797. Cur- ran Associates, Inc., 2021. 2
2021
-
[40]
Learning cross-modal embeddings for cooking recipes and food im- ages
Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learning cross-modal embeddings for cooking recipes and food im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 3020–3028,
-
[41]
Inverse cooking: Recipe generation from food images
Amaia Salvador, Michal Drozdzal, Xavier Gir ´o-i Nieto, and Adriana Romero. Inverse cooking: Recipe generation from food images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10453– 10462, 2019. 2
2019
-
[42]
Rapid non-destructive analysis of food nutrient content using swin-nutrition.Foods, 11(21):3429, 2022
Wenjing Shao, Sujuan Hou, Weikuan Jia, and Yuanjie Zheng. Rapid non-destructive analysis of food nutrient content using swin-nutrition.Foods, 11(21):3429, 2022. 1
2022
-
[43]
Towards learning food portion from monoc- ular images with cross-domain feature adaptation
Zeman Shao, Shaobo Fang, Runyu Mao, Jiangpeng He, Janine L Wright, Deborah A Kerr, Carol J Boushey, and Fengqing Zhu. Towards learning food portion from monoc- ular images with cross-domain feature adaptation. In2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), pages 1–6. IEEE, 2021. 1
2021
-
[44]
An end-to-end food portion estimation framework based on shape reconstruction from monocular image
Zeman Shao, Gautham Vinod, Jiangpeng He, and Fengqing Zhu. An end-to-end food portion estimation framework based on shape reconstruction from monocular image. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 942–947, 2023. 1
2023
-
[45]
Adaptive keyframe sampling for long video understanding
Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29118–29128, 2025. 2
2025
-
[46]
Nutrition5k: To- wards automatic nutritional understanding of generic food
Quin Thames, Arjun Karpur, Wade Norris, Fangting Xia, Liviu Panait, Tobias Weyand, and Jack Sim. Nutrition5k: To- wards automatic nutritional understanding of generic food. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 8903–8911, 2021. 1, 2, 3, 4, 5, 6
2021
-
[47]
VideoMAE: Masked autoencoders are data-efficient learn- ers for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learn- ers for self-supervised video pre-training. InConference on Neural Information Processing Systems, 2022. 2
2022
-
[48]
Food portion estimation via 3d object scaling
Gautham Vinod, Jiangpeng He, Zeman Shao, and Fengqing Zhu. Food portion estimation via 3d object scaling. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3741–3749, 2024. 1
2024
-
[49]
Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan, Jiangpeng He, and Fengqing Zhu. Size Matters: Recon- structing real-scale 3d models from monocular images for food portion estimation.arXiv preprint arXiv:2601.20051,
-
[50]
Omnivl: One foundation model for image-language and video-language tasks.Advances in neu- ral information processing systems, 35:5696–5710, 2022
Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Lu- owei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, and Lu Yuan. Omnivl: One foundation model for image-language and video-language tasks.Advances in neu- ral information processing systems, 35:5696–5710, 2022. 2
2022
-
[51]
Videomae v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14549–14560, 2023. 2
2023
-
[52]
Bevt: Bert pretraining of video transformers
Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, and Lu Yuan. Bevt: Bert pretraining of video transformers. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14733–14743, 2022. 2
2022
-
[53]
Internvideo: General video foundation models via generative and discriminative learning
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 2
-
[54]
Internvideo2: Scaling foundation models for mul- timodal video understanding
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024. 2
2024
-
[55]
Masked feature predic- tion for self-supervised visual pre-training
Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature predic- tion for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14668–14678, 2022. 2
2022
-
[56]
Adaframe: Adaptive frame selection for fast video recognition
Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, and Larry S Davis. Adaframe: Adaptive frame selection for fast video recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1278–1287, 2019. 2
2019
-
[57]
NSNet: Non-saliency suppression sampler for efficient video recog- nition
Boyang Xia, Wenhao Wu, Haoran Wang, Rui Su, Dongliang He, Haosen Yang, Xiaoran Fan, and Wanli Ouyang. NSNet: Non-saliency suppression sampler for efficient video recog- nition. InEuropean conference on computer vision, 2022. 2
2022
-
[58]
Videoclip: Contrastive pre-training for zero-shot video-text understanding
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 6787–6800, 2021. 2
2021
-
[59]
Dietai24 as a frame- work for comprehensive nutrition estimation using multi- modal large language models.Communications Medicine, 5(1):458, 2025
Runze Yan, Hanqi Luo, Jiaying Lu, Darren Liu, Hannah Posluszny, Mehak Preet Dhaliwal, Janice MacLeod, Yao Qin, Carl Yang, Terry J Hartman, et al. Dietai24 as a frame- work for comprehensive nutrition estimation using multi- modal large language models.Communications Medicine, 5(1):458, 2025. 1, 2
2025
-
[60]
IGSMNet: Ingredient-guided semantic modeling network for food nutrition estimation.Foods, 14 (21), 2025
Donglin Zhang, Weixiang Shi, Boyuan Ma, Weiqing Min, and Xiao-Jun Wu. IGSMNet: Ingredient-guided semantic modeling network for food nutrition estimation.Foods, 14 (21), 2025. 1, 2
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.