pith. sign in

arxiv: 1907.11384 · v1 · pith:GYERMLPBnew · submitted 2019-07-26 · 💻 cs.CV

Product Image Recognition with Guidance Learning and Noisy Supervision

Pith reviewed 2026-05-24 16:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords product image recognitionnoisy labelsguidance learningweb dataconvolutional neural networksnoisy supervisionProduct-90
0
0 comments X

The pith

Guidance learning improves CNNs on noisy web product images by combining teacher soft labels with given noisy labels plus a small clean set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Product-90, a dataset of over 140,000 casually captured consumer photos across 90 categories downloaded from e-commerce review pages with automatically assigned but noisy labels. It proposes guidance learning: first train a teacher network on the entire noisy set, then train the target student network in a multi-task way where each noisy example is supervised by the combination of its original noisy label and the teacher's softened output, while also using a small manually verified clean set. Experiments demonstrate that this yields higher accuracy than prior noisy-label methods on Product-90 and on the public Food101, Food-101N, and Clothing1M datasets. A sympathetic reader cares because web-scale image collections are cheap to gather yet noisy, and the method offers a straightforward way to extract useful signal from them without requiring massive clean labels.

Core claim

The paper claims that a student network trained with guidance knowledge—the combination of each example's given noisy label and the softened label produced by a teacher network pretrained on the full noisy dataset—together with a small clean set, achieves superior recognition accuracy on product images compared with state-of-the-art noisy-supervision techniques.

What carries the argument

Guidance learning, the two-stage teacher-student procedure that supplies combined noisy-plus-soft labels to the student network.

If this is right

  • The method handles real-world challenges such as background clutter and category diversity in consumer photos.
  • Large noisy web datasets become usable when paired with only a modest clean set.
  • Performance gains appear across product, food, and clothing recognition tasks.
  • The approach is simple enough to apply directly to existing CNN training pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same teacher-student label combination could be tested on other web-collected image tasks such as general object or scene recognition.
  • Replacing the single teacher pass with an iterative update of soft labels might yield further gains.
  • The soft labels appear to encode useful visual patterns that hard noisy labels miss, suggesting the technique could complement other semi-supervised methods.

Load-bearing premise

The teacher network trained on the full noisy dataset produces soft labels accurate enough that combining them with the given noisy labels improves the student beyond what the clean set or standard noisy-label methods alone can achieve.

What would settle it

Training the student network on the same data splits but without the teacher's soft labels and finding that test accuracy on Product-90 or Clothing1M does not drop below the levels reported for guidance learning.

Figures

Figures reproduced from arXiv: 1907.11384 by Hao Xing, Liangliang Cao, Qing Li, Wenbin Du, Xiaojiang Peng, Yu Qiao.

Figure 1
Figure 1. Figure 1: Example images from our Products-90. We illustrate 5 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the Products-90 dataset. Each image represents one class which is selected from clean data. Different image [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of the collected Products-90. Each color indicates a meta category. (Zoom in for better view.) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The proposed guidance learning framework. At the first stage, we utilize all training data to train a teacher model. At the second [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: We observe that increasing β boosts performance but saturates above 0.3. α balances the importance between the losses of noisy set and clean set, T is the temperature used for softening. We evaluate α and T by fixing β to 0.3. The results are illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation of clean image ratios w.r.t. the original clean [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation of β in Eq.(3) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation of α and T in Eq.(5). outperforms the others regardless of α. Second, increasing β boosts performance in the beginning but degrades after 5, which indicates that a highly-soften operation corrupts su￾pervision knowledge. Third, α and T impact performance jointly which change the loss of noisy set in Eq. (5). 5.3. Experiments on Food-101 and Food-101N Food-101 and Food-101N. The Food-101 dataset … view at source ↗
read the original abstract

This paper considers recognizing products from daily photos, which is an important problem in real-world applications but also challenging due to background clutters, category diversities, noisy labels, etc. We address this problem by two contributions. First, we introduce a novel large-scale product image dataset, termed as Product-90. Instead of collecting product images by labor-and time-intensive image capturing, we take advantage of the web and download images from the reviews of several e-commerce websites where the images are casually captured by consumers. Labels are assigned automatically by the categories of e-commerce websites. Totally the Product-90 consists of more than 140K images with 90 categories. Due to the fact that consumers may upload unrelated images, it is inevitable that our Product-90 introduces noisy labels. As the second contribution, we develop a simple yet efficient \textit{guidance learning} (GL) method for training convolutional neural networks (CNNs) with noisy supervision. The GL method first trains an initial teacher network with the full noisy dataset, and then trains a target/student network with both large-scale noisy set and small manually-verified clean set in a multi-task manner. Specifically, in the stage of student network training, the large-scale noisy data is supervised by its guidance knowledge which is the combination of its given noisy label and the soften label from the teacher network. We conduct extensive experiments on our Products-90 and public datasets, namely Food101, Food-101N, and Clothing1M. Our guidance learning method achieves performance superior to state-of-the-art methods on these datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the Product-90 dataset (>140K images, 90 categories) collected from e-commerce review photos with automatically assigned but noisy labels. It proposes a guidance learning (GL) procedure: train a teacher CNN on the full noisy set, then train a student CNN in multi-task fashion on the noisy set (supervised by a combination of the original noisy label and the teacher's softened prediction) plus a small manually verified clean set. Experiments on Product-90 plus Food-101, Food-101N and Clothing1M are said to show GL outperforming prior state-of-the-art noisy-label methods.

Significance. If the reported gains are robust, the Product-90 dataset supplies a realistic large-scale benchmark for noisy consumer-product imagery, and the GL procedure offers a lightweight way to exploit abundant noisy web data together with limited clean supervision. Both contributions would be of practical value for e-commerce vision tasks.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim (superior performance on four datasets) is asserted without any numerical results, error bars, ablation tables, or description of how the small clean set was chosen or sized; this directly prevents verification of the claim from the given text.
  2. [Guidance Learning] Guidance learning description: the method rests on the assumption that soft labels produced by a teacher trained on the identical noisy data meaningfully augment the original noisy labels beyond what the clean set alone or standard noisy-label techniques achieve; no ablation that removes the soft-label term or compares against clean-set-only training is described, leaving the weakest assumption untested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major points below and indicate the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim (superior performance on four datasets) is asserted without any numerical results, error bars, ablation tables, or description of how the small clean set was chosen or sized; this directly prevents verification of the claim from the given text.

    Authors: We agree that the abstract would be strengthened by the inclusion of key quantitative results and a brief description of the clean-set size and selection. In the revised version we will add specific accuracy figures (with standard deviations where available) for Product-90 and the three public datasets, together with the size of the manually verified clean subset used during student training. revision: yes

  2. Referee: [Guidance Learning] Guidance learning description: the method rests on the assumption that soft labels produced by a teacher trained on the identical noisy data meaningfully augment the original noisy labels beyond what the clean set alone or standard noisy-label techniques achieve; no ablation that removes the soft-label term or compares against clean-set-only training is described, leaving the weakest assumption untested.

    Authors: The referee is correct that the current manuscript does not contain an explicit ablation that isolates the soft-label guidance term from clean-set-only training. While the reported comparisons against prior noisy-label methods already demonstrate gains, we acknowledge that a direct ablation removing the teacher soft-label component would more rigorously test the added value of the guidance signal. We will therefore add these ablation experiments to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Empirical training procedure with no circular derivation

full rationale

The paper presents an empirical method: train a teacher CNN on the full noisy Product-90 (and similar) dataset, then train a student in multi-task fashion using a combination of the original noisy label and the teacher's softened prediction. No equations, uniqueness theorems, or derivations are claimed; performance is evaluated via standard experiments on Product-90, Food-101, Food-101N, and Clothing1M. The central claim (superiority to SOTA) is a reported experimental outcome rather than a result forced by definition, fitted parameters renamed as predictions, or a self-citation chain. The method is self-contained against external benchmarks and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the unverified assumption that a teacher trained on noisy data yields useful guidance signals and that the small clean set is representative enough to anchor the student; no free parameters, axioms, or invented entities are explicitly introduced in the provided text.

pith-pipeline@v0.9.0 · 5823 in / 1219 out tokens · 17175 ms · 2026-05-24T16:06:33.242606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 12 internal anchors

  1. [1]

    Bossard, M

    L. Bossard, M. Guillaumin, and L. Van Gool. Food-101– mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014

  2. [2]

    C. E. Brodley and M. A. Friedl. Identifying mislabeled train- ing data. Journal of artificial intelligence research, 11:131– 167, 1999

  3. [3]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009. 8

  4. [4]

    Fr ´enay and M

    B. Fr ´enay and M. Verleysen. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2014

  5. [5]

    George and C

    M. George and C. Floerkemeier. Recognizing products: A per-exemplar multi-label image classification approach. In ECCV, pages 440–455. Springer, 2014

  6. [6]

    S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, and D. Huang. Curriculumnet: Weakly supervised learning from large-scale web images. arXiv preprint arXiv:1808.01097, 2018

  7. [7]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016

  8. [8]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  9. [9]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

  10. [10]

    MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels

    L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei. Men- tornet: Regularizing very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 2017

  11. [11]

    Joulin, L

    A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In ECCV, pages 67–84. Springer, 2016

  12. [12]

    P. Jund, N. Abdo, A. Eitel, and W. Burgard. The freiburg groceries dataset. arXiv preprint arXiv:1611.05799, 2016

  13. [13]

    Krause, B

    J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable effec- tiveness of noisy data for fine-grained recognition. InECCV, pages 301–320. Springer, 2016

  14. [14]

    K.-H. Lee, X. He, L. Zhang, and L. Yang. Cleannet: Trans- fer learning for scalable image classifier training with label noise. arXiv preprint arXiv:1711.07131, 2017

  15. [15]

    W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool. We- bvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862, 2017

  16. [16]

    Y . Li, J. Yang, Y . Song, L. Cao, J. Luo, and L.-J. Li. Learning from noisy labels with distillation. In ICCV, pages 1928– 1936, 2017

  17. [17]

    S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan. Street-to- shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In CVPR, pages 3330–3337. IEEE, 2012

  18. [18]

    Y . Lu, C. Yuan, Z. Lai, X. Li, W. K. Wong, and D. Zhang. Nuclear norm-based 2dlpp for image classification. IEEE Transactions on Multimedia, 19(11):2391–2403, 2017

  19. [19]

    Manwani and P

    N. Manwani and P. Sastry. Noise tolerance under risk min- imization. IEEE transactions on cybernetics , 43(3):1146– 1151, 2013

  20. [20]

    Merler, C

    M. Merler, C. Galleguillos, and S. Belongie. Recognizing groceries in situ using in vitro training data. In CVPR, pages 1–8. IEEE, 2007

  21. [21]

    A. L. Miranda, L. P. F. Garcia, A. C. Carvalho, and A. C. Lorena. Use of classification algorithms in noise detection and elimination. In International Conference on Hybrid Ar- tificial Intelligence Systems, pages 417–424. Springer, 2009

  22. [22]

    Misra, C

    I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In CVPR, pages 2930– 2939, 2016

  23. [23]

    Mnih and G

    V . Mnih and G. E. Hinton. Learning to label aerial images from noisy data. In ICML, pages 567–574, 2012

  24. [24]

    D. F. Nettleton, A. Orriols-Puig, and A. Fornells. A study of the effect of different types of noise on the precision of su- pervised learning techniques. Artificial intelligence review, 33(4):275–306, 2010

  25. [25]

    Patrini, A

    G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, pages 2233–2241, 2017

  26. [26]

    S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy la- bels with bootstrapping. arXiv preprint arXiv:1412.6596 , 2014

  27. [27]

    Rocha, D

    A. Rocha, D. C. Hauagge, J. Wainer, and S. Goldenstein. Au- tomatic fruit and vegetable classification from images.Com- puters and Electronics in Agriculture, 70(1):96–104, 2010

  28. [28]

    Deep Learning is Robust to Massive Label Noise

    D. Rolnick, A. Veit, S. Belongie, and N. Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017

  29. [29]

    Training Convolutional Networks with Noisy Labels

    S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fer- gus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014

  30. [30]

    Joint Optimization Framework for Learning with Noisy Labels

    D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint optimization framework for learning with noisy labels.arXiv preprint arXiv:1803.11364, 2018

  31. [31]

    A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. J. Belongie. Learning from noisy large-scale datasets with minimal supervision. In CVPR, pages 6575–6583, 2017

  32. [32]

    X.-S. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu. Rpc: A large-scale retail product checkout dataset. arXiv preprint arXiv:1901.07249, 2019

  33. [33]

    T. Xiao, T. Xia, Y . Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015. 9