pith. sign in

arxiv: 2605.20892 · v1 · pith:WTHIJB6Xnew · submitted 2026-05-20 · 💻 cs.CV

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained fruit classificationheterogeneous ensembleMLLM arbitrationchain-of-thought reasoningagricultural computer visionhard sample lossfruit dataset
0
0 comments X

The pith

A two-stage ensemble with MLLM arbitration reaches 70.49 percent accuracy on 306 fruit categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first builds a dataset of 306 fruit categories containing 116233 images to overcome shortages of high-quality data and the problem of high visual similarity between classes. It then proposes FruitEnsemble, which runs a weighted ensemble of different neural network backbones to produce a reliable set of top-three candidates. When the ensemble confidence drops below 0.6, the system activates a multimodal large language model that checks the image against external botanical descriptions using step-by-step reasoning. The training also uses a loss that pays extra attention to hard samples. A sympathetic reader would care because accurate fine-grained recognition supports practical tasks such as automated sorting and quality inspection in agriculture.

Core claim

FruitEnsemble is a practical two-stage dynamic inference framework. In the first stage it employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples an expert arbitration mechanism triggers a multimodal large language model when ensemble confidence falls below 0.6 to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought reasoning. The training is optimized with a hard sample-aware joint loss. This leads to 70.49 percent classification accuracy that outperforms existing state-of-the-art models on the 306-category fruit dataset.

What carries the argument

The expert arbitration mechanism that triggers an MLLM for visual verification using botanical descriptions and CoT reasoning only when ensemble confidence falls below 0.6.

If this is right

  • The framework supplies an efficient deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.
  • It overcomes generalization limits of static single-model architectures on fine-grained problems.
  • The hard sample-aware joint loss improves performance on challenging examples during training.
  • Overall accuracy reaches 70.49 percent and exceeds previous methods on the constructed 306-category dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective use of the MLLM only on uncertain cases could serve as a pattern for balancing accuracy gains against extra computation in other vision tasks.
  • Adding external textual knowledge may help classification systems in domains where labeled images are scarce but descriptive information exists.
  • The two-stage design offers a template for hybrid systems that combine traditional ensembles with language-model verification.

Load-bearing premise

The MLLM arbitration step correctly resolves the difficult samples that the ensemble cannot handle by using external botanical descriptions.

What would settle it

A test showing that accuracy on low-confidence samples stays the same or drops after MLLM arbitration is applied, or that overall accuracy fails to exceed prior state-of-the-art results.

Figures

Figures reproduced from arXiv: 2605.20892 by Enhui Yu, Jialu Li, Junhui Li, Ruitong Lu, Youshan Zhang.

Figure 1
Figure 1. Figure 1: Accuracy vs. inference efficiency on the Fruit-306 test set. Single CNNs (△) are efficient but limited in accuracy. The LLM-only baseline (⋄) is slow and performs poorly with￾out domain-specific training. Static ensembles (•) improve accu￾racy but remain sub-optimal. FruitEnsemble (⋆) achieves the best trade-off by invoking the LLM only for uncertain samples. scale, manual sorting is time-consuming, labor-… view at source ↗
Figure 2
Figure 2. Figure 2: Comprehensive Statistical Analysis of Fruit-306. (a) Histogram of class sizes showing the frequency distribution of samples per category. (b) Cumulative distribution curve indicating that the top 20% of classes account for a significant majority of the total images. (c) Top 20 most frequent classes (Head), dominated by common commercial varieties. (d) Bottom 20 least frequent classes (Tail), highlighting t… view at source ↗
Figure 3
Figure 3. Figure 3: Visual Challenges in Fruit-306. Top Row: High inter￾class similarity. Examples of visually similar pairs (e.g., Fuji vs. Gala) where discrimination relies on subtle textual attributes (e.g., lenticel density). Bottom Row: High intra-class variance. The same variety shown under different lighting, occlusion, and ripeness conditions. These complexities necessitate dynamic rea￾soning beyond static classificat… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the FruitEnsemble Framework. Input fruit images are first processed by a curated set of four heterogeneous backbone models (M1-M4: ResNet50, DenseNet201, EfficientNetB7, ViT-B/16) with architecture-adaptive training strategies. Their output probabilities are fused via an Uncertainty-Aware Weighted Aggregation module (A) to generate top-k candidate predictions and a confidence gap metric (∆ = p1… view at source ↗
Figure 5
Figure 5. Figure 5: Progressive performance improvement of FruitEnsemble. Starting from individual backbones, het￾erogeneous ensemble aggregation, test-time augmentation, and LLM arbitration progressively improve accuracy from 65.5% to 70.5% [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: illustrates the effect of LLM trigger thresholds on overall accuracy. The relatively flat curve suggests that the base ensemble model has high confidence in most sam￾ples, and LLM intervention primarily benefits a small subset of difficult cases. This analysis helps determine the optimal threshold for deploying LLM arbitration in practice, trading off between accuracy gains and computational overhead. 6. C… view at source ↗
read the original abstract

Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript constructs a 306-category fruit dataset (116,233 samples) and proposes FruitEnsemble, a two-stage dynamic inference framework. Stage 1 uses a validation-calibrated weighted ensemble of heterogeneous backbones to produce a top-3 candidate pool. When ensemble confidence falls below 0.6, an MLLM performs arbitration via external botanical descriptions and Chain-of-Thought reasoning. Training incorporates a hard-sample-aware joint loss. The paper reports 70.49% classification accuracy and claims outperformance over existing SOTA models on this dataset.

Significance. If the reported gains are shown to stem from the MLLM arbitration rather than the new dataset or ensemble weighting alone, the work could provide a practical, deployment-oriented solution for fine-grained agricultural vision tasks where visual similarity is high. The large-scale fruit dataset addresses a documented scarcity in the domain. Credit is due for the reproducible two-stage design and focus on real-world sorting applications, though the absence of isolating experiments limits the assessed novelty of the arbitration step.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of 70.49% accuracy and SOTA outperformance is presented without any ablation that isolates the MLLM arbitration component. No comparison of ensemble-only accuracy versus the full FruitEnsemble, no count of samples triggering the <0.6 threshold, and no error analysis showing which classes or ambiguities the botanical CoT resolves are supplied. This directly affects attribution of the performance gain to the novel arbitration mechanism.
  2. [§3.2] §3.2 (Arbitration Mechanism): The assumption that the MLLM step correctly resolves difficult samples rests on external botanical descriptions, yet no quantitative validation (e.g., accuracy lift on the low-confidence subset or failure cases of the MLLM) is reported. Without this, the two-stage design's load-bearing contribution cannot be verified.
minor comments (2)
  1. [§3.1] The hard-sample-aware joint loss is mentioned but its exact formulation (weighting scheme, interaction with the ensemble loss) is not given in sufficient detail for reproduction.
  2. [§2] Dataset construction details (collection protocol, annotation process, train/val/test splits) should be expanded to allow independent verification of the 306-class benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the attribution of results to the MLLM arbitration component.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of 70.49% accuracy and SOTA outperformance is presented without any ablation that isolates the MLLM arbitration component. No comparison of ensemble-only accuracy versus the full FruitEnsemble, no count of samples triggering the <0.6 threshold, and no error analysis showing which classes or ambiguities the botanical CoT resolves are supplied. This directly affects attribution of the performance gain to the novel arbitration mechanism.

    Authors: We acknowledge that the current manuscript lacks explicit ablations isolating the MLLM arbitration. In the revised version we will add a dedicated subsection in §4 that reports (i) top-1 accuracy of the validation-calibrated weighted ensemble alone, (ii) the number and percentage of test samples whose ensemble confidence fell below 0.6 and therefore triggered MLLM arbitration, and (iii) a per-class error analysis highlighting the visual ambiguities resolved by the botanical CoT step. These additions will allow readers to directly attribute performance gains to the arbitration mechanism while preserving the overall 70.49 % result and SOTA comparison. revision: yes

  2. Referee: [§3.2] §3.2 (Arbitration Mechanism): The assumption that the MLLM step correctly resolves difficult samples rests on external botanical descriptions, yet no quantitative validation (e.g., accuracy lift on the low-confidence subset or failure cases of the MLLM) is reported. Without this, the two-stage design's load-bearing contribution cannot be verified.

    Authors: We agree that quantitative evidence for the MLLM step is necessary. We will augment §3.2 and the experimental results with (a) accuracy on the low-confidence subset before versus after MLLM arbitration and (b) representative failure cases where the MLLM did not resolve the ambiguity. These metrics will be computed on the same held-out test split used for the main results, thereby verifying the contribution of the two-stage design without altering the reported overall accuracy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework reports experimental accuracy without derivations or self-referential reductions

full rationale

The paper describes a two-stage empirical method (weighted heterogeneous ensemble plus conditional MLLM arbitration on a new 306-class dataset) and reports a measured accuracy of 70.49%. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central performance claim rests on direct experimental evaluation rather than any chain that reduces by construction to its own inputs. This is the expected non-finding for an applied ML systems paper whose results are externally falsifiable via replication on the stated dataset.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The framework rests on standard supervised learning assumptions plus one explicit threshold; no new physical entities or unproven mathematical axioms are introduced.

free parameters (1)
  • ensemble confidence threshold
    The value 0.6 is used to decide when to invoke the MLLM arbitrator; its selection is not derived from first principles in the abstract.

pith-pipeline@v0.9.0 · 5745 in / 1192 out tokens · 30997 ms · 2026-05-21T05:57:06.650256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 7 internal anchors

  1. [1]

    A diverse ensemble classifier for tomato disease recognition.Computers and Electronics in Agriculture, 198: 107054, 2022

    Mounes Astani, Mohammad Hasheminejad, and Mahsa Vaghefi. A diverse ensemble classifier for tomato disease recognition.Computers and Electronics in Agriculture, 198: 107054, 2022

  2. [2]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021

  3. [3]

    Classification with a reject option using a hinge loss.Journal of Machine Learn- ing Research, 9(8), 2008

    Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss.Journal of Machine Learn- ing Research, 9(8), 2008

  4. [4]

    Classification and grading of multiple varieties of apple fruit.Food Analytical Methods, 14(7):1359–1368, 2021

    Anuja Bhargava and Atul Bansal. Classification and grading of multiple varieties of apple fruit.Food Analytical Methods, 14(7):1359–1368, 2021

  5. [5]

    Fruitvision: A benchmark dataset for fresh, rotten, and formalin-mixed fruit detection

    Md Hasan Imam Bijoy et al. Fruitvision: A benchmark dataset for fresh, rotten, and formalin-mixed fruit detection. Data in Brief, 61:111752, 2025

  6. [6]

    Xception: Deep learning with depthwise separable convolutions

    Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1251–1258, 2017

  7. [7]

    Meta-des: A dynamic ensemble selec- tion framework using meta-learning.Pattern recognition, 48 (5):1925–1935, 2015

    Rafael MO Cruz, Robert Sabourin, George DC Cavalcanti, and Tsang Ing Ren. Meta-des: A dynamic ensemble selec- tion framework using meta-learning.Pattern recognition, 48 (5):1925–1935, 2015

  8. [8]

    Ensemble methods in machine learn- ing

    Thomas G Dietterich. Ensemble methods in machine learn- ing. InInternational workshop on multiple classifier systems, pages 1–15. Springer, 2000

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  10. [10]

    Improving long-tailed pest classification using diffusion model-based data augmenta- tion.Computers and Electronics in Agriculture, 234:110244, 2025

    Mengze Du, Fei Wang, Yu Wang, Kun Li, Wenhui Hou, Lu Liu, Yong He, and Yuwei Wang. Improving long-tailed pest classification using diffusion model-based data augmenta- tion.Computers and Electronics in Agriculture, 234:110244, 2025

  11. [11]

    Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

    Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4438–4446, 2017

  12. [12]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059. PMLR, 2016

  13. [13]

    Smart agriculture: A litera- ture review.Journal of Management Analytics, 10(2):359– 415, 2023

    Disha Garg and Mansaf Alam. Smart agriculture: A litera- ture review.Journal of Management Analytics, 10(2):359– 415, 2023

  14. [14]

    Fruits 360: A dataset of images contain- ing fruits and vegetables.https://www.kaggle.com/ datasets/moltean/fruits, 2018

    Mihai Gheorghe. Fruits 360: A dataset of images contain- ing fruits and vegetables.https://www.kaggle.com/ datasets/moltean/fruits, 2018. Accessed: 2024- 05-20

  15. [15]

    Generating Sequences With Recurrent Neural Networks

    Alex Graves. Generating sequences with recurrent neural networks.arXiv preprint arXiv:1308.0850, 2013

  16. [16]

    Transfg: A trans- former architecture for fine-grained recognition

    Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, and Changhu Wang. Transfg: A trans- former architecture for fine-grained recognition. InProceed- ings of the AAAI conference on artificial intelligence, pages 852–860, 2022

  17. [17]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  18. [18]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  19. [19]

    Vegfru: A domain-specific dataset for fine-grained visual categoriza- tion

    Saihui Hou, Yushan Feng, and Zilei Wang. Vegfru: A domain-specific dataset for fine-grained visual categoriza- tion. InProceedings of the IEEE international conference on computer vision, pages 541–549, 2017

  20. [20]

    Dualnet: Learn com- plementary features for image recognition

    Saihui Hou, Xu Liu, and Zilei Wang. Dualnet: Learn com- plementary features for image recognition. InProceedings of the IEEE international conference on computer vision, pages 502–510, 2017

  21. [21]

    Densely connected convolutional net- works

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

  22. [22]

    Imagenet classification with deep convolutional neural net- works.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.Advances in neural information processing systems, 25, 2012

  23. [23]

    Diversity regularized ensemble pruning

    Nan Li, Yang Yu, and Zhi-Hua Zhou. Diversity regularized ensemble pruning. InJoint European conference on machine learning and knowledge discovery in databases, pages 330–

  24. [24]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

  25. [25]

    Deep learning for fine-grained classification of jujube fruit in the natural environment.Journal of Food Measurement and Characterization, 15(5):4150–4165, 2021

    Xi Meng, Yingchun Yuan, Guifa Teng, and Tianzhen Liu. Deep learning for fine-grained classification of jujube fruit in the natural environment.Journal of Food Measurement and Characterization, 15(5):4150–4165, 2021

  26. [26]

    Recurrent vision transformer for solving visual reasoning problems

    Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, and Fabrizio Falchi. Recurrent vision transformer for solving visual reasoning problems. InInternational Con- ference on Image Analysis and Processing, pages 50–61. Springer, 2022

  27. [27]

    Uncovering bias in the plantvillage dataset.arXiv preprint arXiv:2206.04374, 2022

    Mehmet Alican Noyan. Uncovering bias in the plantvillage dataset.arXiv preprint arXiv:2206.04374, 2022

  28. [28]

    Fruits 360 dataset on github

    Mihai Oltean and Horea Muresan. Fruits 360 dataset on github. 2017

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  30. [30]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

  31. [31]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015

  32. [32]

    Rethinking the inception archi- tecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception archi- tecture for computer vision. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2818–2826, 2016

  33. [33]

    Inception-v4, inception-resnet and the im- pact of residual connections on learning

    Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. Inception-v4, inception-resnet and the im- pact of residual connections on learning. InProceedings of the AAAI conference on artificial intelligence, 2017

  34. [34]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR, 2019

  35. [35]

    Branchynet: Fast inference via early exiting from deep neural networks

    Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In2016 23rd international con- ference on pattern recognition (ICPR), pages 2464–2469. IEEE, 2016

  36. [36]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  37. [37]

    A large language model for multimodal iden- tification of crop diseases and pests.Scientific Reports, 15 (1):21959, 2025

    Yiqun Wang, Fahai Wang, Wenbai Chen, Bowen Lv, Mengchen Liu, Xiangyuan Kong, Chunjiang Zhao, and Zhaocen Pan. A large language model for multimodal iden- tification of crop diseases and pests.Scientific Reports, 15 (1):21959, 2025

  38. [38]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

  39. [39]

    Cvt: Introduc- ing convolutions to vision transformers

    Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introduc- ing convolutions to vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021

  40. [40]

    Research on citrus grading system based on ma- chine vision.Systems Science & Control Engineering, 13(1): 2460443, 2025

    Miao Xu, Xuan Zhang, ChangJun Zhan, JianYu Ge, and Hua Yang. Research on citrus grading system based on ma- chine vision.Systems Science & Control Engineering, 13(1): 2460443, 2025

  41. [41]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 2023

  42. [42]

    A survey on large language model (llm) security and privacy: The good, the bad, and the ugly

    Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211, 2024

  43. [43]

    How transferable are features in deep neural networks?Ad- vances in neural information processing systems, 27, 2014

    Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?Ad- vances in neural information processing systems, 27, 2014

  44. [44]

    Lookahead optimizer: k steps forward, 1 step back.Advances in neural information processing systems, 32, 2019

    Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back.Advances in neural information processing systems, 32, 2019

  45. [45]

    Part-based r-cnns for fine-grained category detection

    Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Dar- rell. Part-based r-cnns for fine-grained category detection. InEuropean conference on computer vision, pages 834–849. Springer, 2014

  46. [46]

    Illusionbench: A large- scale and comprehensive benchmark for visual illusion un- derstanding in vision-language models

    Yiming Zhang, Zicheng Zhang, Xinyi Wei, Xiaohong Liu, Guangtao Zhai, and Xiongkuo Min. Illusionbench: A large- scale and comprehensive benchmark for visual illusion un- derstanding in vision-language models. In2025 IEEE Inter- national Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

  47. [47]

    Learning transferable architectures for scalable image recognition

    Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018