FoodX-251: A Dataset for Fine-grained Food Classification

Ajay Divakaran; Karan Sikka; Parneet Kaur; Serge Belongie; Weijun Wang

arxiv: 1907.06167 · v1 · pith:2SVOBSKEnew · submitted 2019-07-14 · 💻 cs.CV

FoodX-251: A Dataset for Fine-grained Food Classification

Parneet Kaur , Karan Sikka , Weijun Wang , Serge Belongie , Ajay Divakaran This is my paper

Pith reviewed 2026-05-24 22:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords food classificationfine-grained visual categorizationdatasetweb image collectionhuman verificationdeep learning baselinescomputer vision

0 comments

The pith

FoodX-251 supplies 251 fine-grained food categories and 158k web images to train and test deep models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FoodX-251 to fill the gap in large-scale resources needed for fine-grained food classification, where categories look very similar. It assembles 251 categories totaling 158,000 images from web sources, splits them into 118,000 training images and 40,000 human-verified images for validation and testing, and describes the collection steps. Baselines using deep learning models are reported, and the set powered the iFood-2019 challenge.

Core claim

We introduce FoodX-251, a dataset of 251 fine-grained food categories with 158k images collected from the web. We use 118k images as a training set and provide human verified labels for 40k images that can be used for validation and testing. The procedure of creating this dataset is outlined and relevant baselines with deep learning models are provided.

What carries the argument

The FoodX-251 dataset, which organizes 251 categories into a web-sourced training split plus a human-verified validation and test split to serve as a benchmark resource.

If this is right

Models trained on the 118k images can be evaluated reliably on the 40k verified set for fair comparisons across methods.
The dataset enables challenges and shared benchmarks that standardize progress measurement in fine-grained food tasks.
The outlined collection and verification procedure can be repeated to expand the number of categories or images over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same web-scraping plus verification pipeline could be applied to create comparable resources for other domains with high visual similarity, such as plant species or vehicle models.
Combining the image set with recipe text or ingredient lists might produce multimodal models that outperform image-only baselines on the same categories.
If models reach high accuracy on FoodX-251, they become candidates for deployment in mobile apps that log meals from photos.

Load-bearing premise

Web-sourced images, once filtered and human-verified according to the described steps, form a representative and correctly labeled collection that improves training and evaluation of food classification models.

What would settle it

Re-inspect a random sample of the 40k verified images for label errors or train several standard deep models and measure whether accuracy improves substantially over prior smaller food datasets; high error rates or flat performance gains would undermine the claim.

Figures

Figures reproduced from arXiv: 1907.06167 by Ajay Divakaran, Karan Sikka, Parneet Kaur, Serge Belongie, Weijun Wang.

**Figure 2.** Figure 2: Noise in web data. Cross-domain noise: Along with the images of specific food class, web image search also includes images of processed and packaged food items and their ingredients. Cross-category noise: An image may have multiple food items but only one label as its ground truth. Dataset Classes Total Images Source Food-type ETHZ Food-101 [7] 101 101,000 foodspotting.com Misc. UPMC Food-101 [26] 101 90,8… view at source ↗

**Figure 3.** Figure 3: [Left] Training images distribution per class. [Right] [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Food classification is a challenging problem due to the large number of categories, high visual similarity between different foods, as well as the lack of datasets for training state-of-the-art deep models. Solving this problem will require advances in both computer vision models as well as datasets for evaluating these models. In this paper we focus on the second aspect and introduce FoodX-251, a dataset of 251 fine-grained food categories with 158k images collected from the web. We use 118k images as a training set and provide human verified labels for 40k images that can be used for validation and testing. In this work, we outline the procedure of creating this dataset and provide relevant baselines with deep learning models. The FoodX-251 dataset has been used for organizing iFood-2019 challenge in the Fine-Grained Visual Categorization workshop (FGVC6 at CVPR 2019) and is available for download.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces FoodX-251, a dataset of 251 fine-grained food categories containing 158k web-collected images. It designates 118k images as a training set and supplies human-verified labels on 40k images for validation and testing. The manuscript outlines the dataset creation procedure, reports baseline results obtained with deep learning models, and notes that the dataset was used to organize the iFood-2019 challenge at FGVC6 (CVPR 2019).

Significance. If the collection and verification procedures produce a representative sample with reliably accurate labels, FoodX-251 would constitute a useful addition to the set of resources available for fine-grained visual categorization, particularly for food images where existing datasets are limited. The fact that the dataset has already supported an organized challenge supplies independent evidence of its practical utility for benchmarking.

major comments (2)

[Abstract] Abstract: the statement that the 40k images carry 'human verified labels' is presented without any accompanying description of the verification protocol, number of annotators per image, inter-annotator agreement statistics, or quantitative label-accuracy measurements. These details are load-bearing for the claim that the split can be used for reliable validation and testing of state-of-the-art models.
[Abstract] Abstract: baselines with deep learning models are asserted to be provided, yet the supplied text contains neither numerical performance figures, error analysis, nor comparisons against prior food datasets. Without these results it is impossible to gauge whether FoodX-251 poses a meaningfully harder or more representative benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each comment below and will revise the manuscript to improve the abstract's informativeness while preserving its length constraints.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that the 40k images carry 'human verified labels' is presented without any accompanying description of the verification protocol, number of annotators per image, inter-annotator agreement statistics, or quantitative label-accuracy measurements. These details are load-bearing for the claim that the split can be used for reliable validation and testing of state-of-the-art models.

Authors: We agree that the abstract would be strengthened by briefly summarizing the verification protocol. The full manuscript (Section 3) outlines the human verification process for the 40k images. We will revise the abstract to include a concise reference to this protocol. Inter-annotator agreement statistics and quantitative label-accuracy measurements were not collected during dataset creation; the verification followed the multi-annotator protocol described in the main text. We can add a note on this if the referee considers it necessary. revision: yes
Referee: [Abstract] Abstract: baselines with deep learning models are asserted to be provided, yet the supplied text contains neither numerical performance figures, error analysis, nor comparisons against prior food datasets. Without these results it is impossible to gauge whether FoodX-251 poses a meaningfully harder or more representative benchmark.

Authors: The full manuscript reports baseline results with deep learning models, including numerical performance figures, in the experiments section, along with error analysis and comparisons to prior food datasets. We will revise the abstract to incorporate key numerical results and a brief statement on benchmark characteristics to better convey its utility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; dataset release paper with no derivation chain

full rationale

The paper introduces FoodX-251 as a new dataset collected from the web with human verification. No mathematical derivations, equations, fitted parameters, or predictions are present. The central contribution is the data collection procedure and release itself, which does not reduce to any self-referential construction or self-citation load-bearing step. External use in the iFood-2019 challenge provides independent support. This is a standard non-circular dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper. No free parameters, mathematical axioms, or invented entities are required for the central claim, which rests on the data collection and labeling process described at a high level in the abstract.

pith-pipeline@v0.9.0 · 5699 in / 1196 out tokens · 26414 ms · 2026-05-24T22:09:18.545279+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 4 internal anchors

[1]

https://www.fitbit.com/app

Fitbit app. https://www.fitbit.com/app. Ac- cessed: 2017-11-14. 1

work page 2017
[2]

https://play.google.com/store/ apps/details?id=com.dietcoacher.sos

My diet coach. https://play.google.com/store/ apps/details?id=com.dietcoacher.sos. Ac- cessed: 2017-11-14. 1

work page 2017
[3]

https://www.myfitnesspal.com

Myﬁtnesspal. https://www.myfitnesspal.com. Accessed: 2017-11-14. 1

work page 2017
[4]

Segmentation and recognition of multi-food meal images for carbohydrate counting

Marios Anthimopoulos, Joachim Dehais, Peter Diem, and Stavroula Mougiakakou. Segmentation and recognition of multi-food meal images for carbohydrate counting. In BIBE, pages 1–4. IEEE, 2013. 2

work page 2013
[5]

Leveraging context to sup- port automated food recognition in restaurants

Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gre- gory D Abowd, and Irfan Essa. Leveraging context to sup- port automated food recognition in restaurants. In WACV, pages 580–587. IEEE, 2015. 2

work page 2015
[6]

Natural lan- guage processing with Python: analyzing text with the natu- ral language toolkit

Steven Bird, Ewan Klein, and Edward Loper. Natural lan- guage processing with Python: analyzing text with the natu- ral language toolkit. ” O’Reilly Media, Inc.”, 2009. 3

work page 2009
[7]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014. 2, 3 6https://github.com/karansikka1/iFood 2019 7https://github.com/karansikka1/Foodx

work page 2014
[8]

Automatic chinese food identiﬁcation and quantity estimation

Mei-Yun Chen, Yung-Hsiang Yang, Chia-Ju Ho, Shih-Han Wang, Shane-Ming Liu, Eugene Chang, Che-Hua Yeh, and Ming Ouhyoung. Automatic chinese food identiﬁcation and quantity estimation. In SIGGRAPH Asia 2012 Technical Briefs, page 29. ACM, 2012. 2

work page 2012
[9]

Webly supervised learn- ing of convolutional networks

Xinlei Chen and Abhinav Gupta. Webly supervised learn- ing of convolutional networks. In ICCV, pages 1431–1439,

work page
[10]

ChineseFoodNet: A large-scale Image Dataset for Chinese Food Recognition

Xin Chen, Yu Zhu, Hua Zhou, Liang Diao, and Dongyan Wang. Chinesefoodnet: A large-scale image dataset for chi- nese food recognition. arXiv preprint arXiv:1705.02743 ,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Rethinking the mobile food journal: Exploring op- portunities for lightweight photo-based capture

Felicia Cordeiro, Elizabeth Bales, Erin Cherry, and James Fogarty. Rethinking the mobile food journal: Exploring op- portunities for lightweight photo-based capture. In HFCS, pages 3207–3216. ACM, 2015. 1

work page 2015
[12]

Ilsvrc-2012, 2012

J Deng, A Berg, S Satheesh, H Su, A Khosla, and L Fei-Fei. Ilsvrc-2012, 2012. URL http://www. image-net. org/challenges/LSVRC, 2012. 3

work page 2012
[13]

Retrieval and classi- ﬁcation of food images.Computers in biology and medicine, 77:23–39, 2016

Giovanni Maria Farinella, Dario Allegra, Marco Moltisanti, Filippo Stanco, and Sebastiano Battiato. Retrieval and classi- ﬁcation of food images.Computers in biology and medicine, 77:23–39, 2016. 2

work page 2016
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 3

work page 2016
[15]

Image recog- nition of 85 food categories by feature fusion

Hajime Hoashi, Taichi Joutou, and Keiji Yanai. Image recog- nition of 85 food categories by feature fusion. In ISM, pages 296–301. IEEE, 2010. 2

work page 2010
[16]

A food image recognition system with multiple kernel learning

Taichi Joutou and Keiji Yanai. A food image recognition system with multiple kernel learning. In ICIP, pages 285–

work page
[17]

Combining Weakly and Webly Supervised Learning for Classifying Food Images

Parneet Kaur, Karan Sikka, and Ajay Divakaran. Combining weakly and webly supervised learning for classifying food images. arXiv preprint arXiv:1712.08730, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Automatic expansion of a food image dataset leveraging existing categories with domain adaptation

Yoshiyuki Kawano and Keiji Yanai. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. In ECCV, pages 3–17, 2014. 2

work page 2014
[19]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014. 1

work page 2014
[21]

Im2calories: towards an automated mobile vision food diary

Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korat- tikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P Murphy. Im2calories: towards an automated mobile vision food diary. In ICCV, pages 1233–1241, 2015. 1, 2

work page 2015
[22]

Nutrinet: a deep learning food and drink image recognition system for dietary assessment

Simon Mezgec and Barbara Korou ˇsi´c Seljak. Nutrinet: a deep learning food and drink image recognition system for dietary assessment. Nutrients, 9(7):657, 2017. 2

work page 2017
[23]

Wordnet: a lexical database for english

George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. 3

work page 1995
[24]

Recognition and volume estimation of food intake using a mobile device

Manika Puri, Zhiwei Zhu, Qian Yu, Ajay Divakaran, and Harpreet Sawhney. Recognition and volume estimation of food intake using a mobile device. In WACV, pages 1–8. IEEE, 2009. 1

work page 2009
[25]

Training Convolutional Networks with Noisy Labels

Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014. 2

work page internal anchor Pith review Pith/arXiv arXiv 2080
[26]

Recipe recognition with large multi- modal food dataset

Xin Wang, Devinder Kumar, Nicolas Thome, Matthieu Cord, and Frederic Precioso. Recipe recognition with large multi- modal food dataset. In ICMEW, pages 1–6. IEEE, 2015. 2

work page 2015
[27]

Annotating images by mining image search results

Xin-Jing Wang, Lei Zhang, Xirong Li, and Wei-Ying Ma. Annotating images by mining image search results. TPAMI, 30(11):1919–1932, 2008. 2

work page 1919
[28]

snap-n-eat food recognition and nu- trition estimation on a smartphone

Weiyu Zhang, Qian Yu, Behjat Siddiquie, Ajay Divakaran, and Harpreet Sawhney. snap-n-eat food recognition and nu- trition estimation on a smartphone. JDST, 9(3):525–533,

work page
[29]

Places: A 10 million image database for scene recognition

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017. 1

work page 2017

[1] [1]

https://www.fitbit.com/app

Fitbit app. https://www.fitbit.com/app. Ac- cessed: 2017-11-14. 1

work page 2017

[2] [2]

https://play.google.com/store/ apps/details?id=com.dietcoacher.sos

My diet coach. https://play.google.com/store/ apps/details?id=com.dietcoacher.sos. Ac- cessed: 2017-11-14. 1

work page 2017

[3] [3]

https://www.myfitnesspal.com

Myﬁtnesspal. https://www.myfitnesspal.com. Accessed: 2017-11-14. 1

work page 2017

[4] [4]

Segmentation and recognition of multi-food meal images for carbohydrate counting

Marios Anthimopoulos, Joachim Dehais, Peter Diem, and Stavroula Mougiakakou. Segmentation and recognition of multi-food meal images for carbohydrate counting. In BIBE, pages 1–4. IEEE, 2013. 2

work page 2013

[5] [5]

Leveraging context to sup- port automated food recognition in restaurants

Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gre- gory D Abowd, and Irfan Essa. Leveraging context to sup- port automated food recognition in restaurants. In WACV, pages 580–587. IEEE, 2015. 2

work page 2015

[6] [6]

Natural lan- guage processing with Python: analyzing text with the natu- ral language toolkit

Steven Bird, Ewan Klein, and Edward Loper. Natural lan- guage processing with Python: analyzing text with the natu- ral language toolkit. ” O’Reilly Media, Inc.”, 2009. 3

work page 2009

[7] [7]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014. 2, 3 6https://github.com/karansikka1/iFood 2019 7https://github.com/karansikka1/Foodx

work page 2014

[8] [8]

Automatic chinese food identiﬁcation and quantity estimation

Mei-Yun Chen, Yung-Hsiang Yang, Chia-Ju Ho, Shih-Han Wang, Shane-Ming Liu, Eugene Chang, Che-Hua Yeh, and Ming Ouhyoung. Automatic chinese food identiﬁcation and quantity estimation. In SIGGRAPH Asia 2012 Technical Briefs, page 29. ACM, 2012. 2

work page 2012

[9] [9]

Webly supervised learn- ing of convolutional networks

Xinlei Chen and Abhinav Gupta. Webly supervised learn- ing of convolutional networks. In ICCV, pages 1431–1439,

work page

[10] [10]

ChineseFoodNet: A large-scale Image Dataset for Chinese Food Recognition

Xin Chen, Yu Zhu, Hua Zhou, Liang Diao, and Dongyan Wang. Chinesefoodnet: A large-scale image dataset for chi- nese food recognition. arXiv preprint arXiv:1705.02743 ,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Rethinking the mobile food journal: Exploring op- portunities for lightweight photo-based capture

Felicia Cordeiro, Elizabeth Bales, Erin Cherry, and James Fogarty. Rethinking the mobile food journal: Exploring op- portunities for lightweight photo-based capture. In HFCS, pages 3207–3216. ACM, 2015. 1

work page 2015

[12] [12]

Ilsvrc-2012, 2012

J Deng, A Berg, S Satheesh, H Su, A Khosla, and L Fei-Fei. Ilsvrc-2012, 2012. URL http://www. image-net. org/challenges/LSVRC, 2012. 3

work page 2012

[13] [13]

Retrieval and classi- ﬁcation of food images.Computers in biology and medicine, 77:23–39, 2016

Giovanni Maria Farinella, Dario Allegra, Marco Moltisanti, Filippo Stanco, and Sebastiano Battiato. Retrieval and classi- ﬁcation of food images.Computers in biology and medicine, 77:23–39, 2016. 2

work page 2016

[14] [14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 3

work page 2016

[15] [15]

Image recog- nition of 85 food categories by feature fusion

Hajime Hoashi, Taichi Joutou, and Keiji Yanai. Image recog- nition of 85 food categories by feature fusion. In ISM, pages 296–301. IEEE, 2010. 2

work page 2010

[16] [16]

A food image recognition system with multiple kernel learning

Taichi Joutou and Keiji Yanai. A food image recognition system with multiple kernel learning. In ICIP, pages 285–

work page

[17] [17]

Combining Weakly and Webly Supervised Learning for Classifying Food Images

Parneet Kaur, Karan Sikka, and Ajay Divakaran. Combining weakly and webly supervised learning for classifying food images. arXiv preprint arXiv:1712.08730, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Automatic expansion of a food image dataset leveraging existing categories with domain adaptation

Yoshiyuki Kawano and Keiji Yanai. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. In ECCV, pages 3–17, 2014. 2

work page 2014

[19] [19]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014. 1

work page 2014

[21] [21]

Im2calories: towards an automated mobile vision food diary

Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korat- tikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P Murphy. Im2calories: towards an automated mobile vision food diary. In ICCV, pages 1233–1241, 2015. 1, 2

work page 2015

[22] [22]

Nutrinet: a deep learning food and drink image recognition system for dietary assessment

Simon Mezgec and Barbara Korou ˇsi´c Seljak. Nutrinet: a deep learning food and drink image recognition system for dietary assessment. Nutrients, 9(7):657, 2017. 2

work page 2017

[23] [23]

Wordnet: a lexical database for english

George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. 3

work page 1995

[24] [24]

Recognition and volume estimation of food intake using a mobile device

Manika Puri, Zhiwei Zhu, Qian Yu, Ajay Divakaran, and Harpreet Sawhney. Recognition and volume estimation of food intake using a mobile device. In WACV, pages 1–8. IEEE, 2009. 1

work page 2009

[25] [25]

Training Convolutional Networks with Noisy Labels

Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014. 2

work page internal anchor Pith review Pith/arXiv arXiv 2080

[26] [26]

Recipe recognition with large multi- modal food dataset

Xin Wang, Devinder Kumar, Nicolas Thome, Matthieu Cord, and Frederic Precioso. Recipe recognition with large multi- modal food dataset. In ICMEW, pages 1–6. IEEE, 2015. 2

work page 2015

[27] [27]

Annotating images by mining image search results

Xin-Jing Wang, Lei Zhang, Xirong Li, and Wei-Ying Ma. Annotating images by mining image search results. TPAMI, 30(11):1919–1932, 2008. 2

work page 1919

[28] [28]

snap-n-eat food recognition and nu- trition estimation on a smartphone

Weiyu Zhang, Qian Yu, Behjat Siddiquie, Ajay Divakaran, and Harpreet Sawhney. snap-n-eat food recognition and nu- trition estimation on a smartphone. JDST, 9(3):525–533,

work page

[29] [29]

Places: A 10 million image database for scene recognition

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017. 1

work page 2017