arxiv: 1306.5151 · v1 · submitted 2013-06-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Fine-Grained Visual Classification of Aircraft

Subhransu Maji , Esa Rahtu , Juho Kannala , Matthew Blaschko , Andrea Vedaldi

Authors on Pith no claims yet

Pith reviewed 2026-05-11 17:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grained visual classificationaircraft datasetFGVC-Aircraftimage classificationcomputer visionbenchmark datasetrigid objectsobject recognition

0 comments

The pith

The paper introduces FGVC-Aircraft, a dataset of 10,000 images across 100 aircraft models organized in a three-level hierarchy for fine-grained visual classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes FGVC-Aircraft as a benchmark dataset for fine-grained visual classification by providing 10,000 images of 100 aircraft models organized hierarchically. The authors show that model differences are subtle but visually measurable, creating challenging yet solvable tasks distinct from those with deformable objects like animals. They supply evaluation protocols and baseline results while noting that enthusiast contributions enabled the dataset and could apply to other classes. Aircraft variations include purpose, size, designation, structure, historical style, and branding, offering new modes of variation for study.

Core claim

The central discovery is the FGVC-Aircraft dataset itself, which contains 10,000 images of aircraft from 100 models arranged in a three-level hierarchy. At the finest level, differences between models are often subtle but always visually measurable, making visual recognition challenging but possible. Corresponding classification tasks and evaluation protocols are defined, with baseline results presented. The dataset's creation leverages work by aircraft enthusiasts, a method extendable to other object classes. Compared to typical fine-grained domains like animals, aircraft are rigid and less deformable but exhibit interesting variations in purpose, size, designation, structure, historical, 1

What carries the argument

The FGVC-Aircraft dataset, a hierarchically organized collection of 10,000 aircraft images across 100 models that enables definition of fine-grained classification tasks.

If this is right

Defines specific classification tasks and evaluation protocols based on the hierarchy.
Provides baseline performance results for standard classification methods on the dataset.
Shows that enthusiast-sourced data can construct useful fine-grained datasets for other object classes.
Identifies unique variation modes in aircraft such as historical style and branding that differ from animal domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithms developed on this dataset might transfer to practical applications like automatic aircraft type identification at airports.
The three-level hierarchy could support hierarchical classification approaches that improve accuracy by leveraging coarser categories first.
Future work might compare results here to other FGVC datasets to understand the impact of object rigidity on recognition difficulty.
Extending the enthusiast-contribution method could rapidly create benchmarks for other vehicle or manufactured object classes.

Load-bearing premise

That the visual differences between the 100 aircraft models are always measurable from the images and that the three-level hierarchy provides a useful structure for the classification tasks.

What would settle it

A demonstration that certain pairs of aircraft models cannot be reliably distinguished by visual inspection of the dataset images, or that the provided baselines fail to exceed random guessing, would falsify the claim that the dataset enables meaningful fine-grained classification.

read the original abstract

This paper introduces FGVC-Aircraft, a new dataset containing 10,000 images of aircraft spanning 100 aircraft models, organised in a three-level hierarchy. At the finer level, differences between models are often subtle but always visually measurable, making visual recognition challenging but possible. A benchmark is obtained by defining corresponding classification tasks and evaluation protocols, and baseline results are presented. The construction of this dataset was made possible by the work of aircraft enthusiasts, a strategy that can extend to the study of number of other object classes. Compared to the domains usually considered in fine-grained visual classification (FGVC), for example animals, aircraft are rigid and hence less deformable. They, however, present other interesting modes of variation, including purpose, size, designation, structure, historical style, and branding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FGVC-Aircraft, a new dataset of 10,000 images spanning 100 aircraft models organized in a three-level hierarchy. It defines corresponding classification tasks and evaluation protocols at different hierarchy levels and presents baseline results obtained with standard methods. The construction relies on contributions from aircraft enthusiasts, and the paper notes that aircraft are rigid objects presenting modes of variation such as purpose, size, designation, structure, historical style, and branding.

Significance. If the labels and splits are reliable, the dataset supplies a useful benchmark for fine-grained visual classification on rigid objects whose inter-class differences are often subtle. The three-level hierarchy supports multi-granularity experiments, and the enthusiast-sourcing approach offers a scalable template for other domains. Baseline numbers establish an initial reference point for future method comparisons.

major comments (2)

[Dataset construction and annotation] The abstract asserts that 'differences between models are often subtle but always visually measurable' and that the hierarchy is useful, yet the manuscript provides no dedicated section or table quantifying inter-annotator agreement, label verification procedure, or the fraction of model pairs whose visual separability was explicitly checked. This verification step is load-bearing for the claim that the 100-class task is 'challenging but possible.'
[Tasks, protocols, and baselines] The evaluation protocols are described at a high level, but the paper does not report the exact train/validation/test splits per hierarchy level or the number of images per model. Without these numbers (or a supplementary table), it is difficult to reproduce the baselines or assess class balance.

minor comments (2)

[Figures] Figure 1 (example images) would benefit from captions that explicitly indicate the three hierarchy levels for each shown aircraft.
[Introduction] The related-work discussion could cite the exact prior FGVC datasets (e.g., CUB-200-2011) when contrasting deformable vs. rigid object challenges.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the FGVC-Aircraft dataset as a benchmark for fine-grained classification of rigid objects. We address the major comments point by point below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [Dataset construction and annotation] The abstract asserts that 'differences between models are often subtle but always visually measurable' and that the hierarchy is useful, yet the manuscript provides no dedicated section or table quantifying inter-annotator agreement, label verification procedure, or the fraction of model pairs whose visual separability was explicitly checked. This verification step is load-bearing for the claim that the 100-class task is 'challenging but possible.'

Authors: We acknowledge that the manuscript would benefit from greater transparency on the annotation process. The dataset was constructed through contributions by aircraft enthusiasts possessing domain expertise, which guided the selection of 100 models where inter-model differences are visually measurable (as asserted in the abstract). However, we did not include a dedicated section quantifying inter-annotator agreement or explicit pairwise separability checks. In the revised version, we will add a section on dataset construction that describes the label collection and verification procedures employed, thereby supporting the claim that the 100-class task is challenging but possible. revision: yes
Referee: [Tasks, protocols, and baselines] The evaluation protocols are described at a high level, but the paper does not report the exact train/validation/test splits per hierarchy level or the number of images per model. Without these numbers (or a supplementary table), it is difficult to reproduce the baselines or assess class balance.

Authors: We agree that the exact splits and per-model image counts are necessary for full reproducibility and class-balance assessment. While the manuscript states the overall dataset size (10,000 images across 100 models) and describes the evaluation protocols at a high level, it does not tabulate the precise train/validation/test splits per hierarchy level or the image counts per model. We will add a supplementary table (or expanded section) providing these details in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a dataset introduction paper whose central contribution is the release of FGVC-Aircraft (10k images, 100 models, three-level hierarchy) together with task definitions and baselines. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. Claims about subtle but visually measurable differences and hierarchy usefulness are stated as descriptive properties of the collected data rather than derived results. The enthusiast-sourcing strategy is presented only as an extensible construction method, not as a self-referential proof. No self-citations or ansatzes are invoked to support load-bearing steps, so the derivation chain (such as it is) is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the domain assumption that aircraft images can be hierarchically organized and that visual differences are measurable; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Differences between aircraft models are subtle but always visually measurable.
Stated in the abstract as the basis for the classification challenge.

pith-pipeline@v0.9.0 · 5436 in / 1063 out tokens · 41373 ms · 2026-05-11T17:35:53.008576+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This paper introduces FGVC-Aircraft, a new dataset containing 10,000 images of aircraft spanning 100 aircraft models, organised in a three-level hierarchy. At the finer level, differences between models are often subtle but always visually measurable.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A benchmark is obtained by defining corresponding classification tasks and evaluation protocols, and baseline results are presented.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 49 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
cs.CV 2026-05 unverdicted novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition
cs.CV 2026-05 conditional novelty 7.0

FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.
Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering
cs.CV 2026-05 conditional novelty 7.0

GSEC uses MLLM-generated semantic guidance and bi-layer ensemble learning to reduce bias and variance, outperforming 18 prior methods on six image clustering benchmarks.
Online Continual Learning with Dynamic Label Hierarchies
cs.LG 2026-05 unverdicted novelty 7.0

HALO improves online continual learning under evolving label hierarchies by adaptively combining classification heads regularized with organized learnable prototypes for better adaptation and reduced forgetting.
MC-RFM: Geometry-Aware Few-Shot Adaptation via Mixed-Curvature Riemannian Flow Matching
cs.CV 2026-05 unverdicted novelty 7.0

MC-RFM achieves superior few-shot adaptation by representing features on a mixed hyperbolic-Euclidean manifold and learning task-conditioned continuous transport via Riemannian flow matching to hybrid prototypes.
SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data
cs.LG 2026-05 unverdicted novelty 7.0

SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships ...
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 7.0

HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.
Hierarchically Robust Zero-shot Vision-language Models
cs.CV 2026-04 unverdicted novelty 7.0

A hierarchical adversarial fine-tuning method for VLMs aligns image and text embeddings at multiple hierarchy depths with theoretical margin connections to boost robustness to leaf and superclass attacks while using m...
Improving Sparse Autoencoder with Dynamic Attention
cs.LG 2026-04 unverdicted novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
cs.CV 2026-04 unverdicted novelty 7.0

CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.
CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion
cs.CR 2026-04 unverdicted novelty 7.0

CLIP-Inspector reconstructs OOD triggers to detect backdoors in prompt-tuned CLIP models with 94% accuracy and higher AUROC than baselines, plus a repair step via fine-tuning.
FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios
cs.CV 2026-04 conditional novelty 7.0

FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.
Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study
cs.SD 2026-05 conditional novelty 6.0

In moderate-sized fine-grained bioacoustics, pretraining scale of masked autoencoders on diverse general audio dominates over domain-specific objectives or data curation for transfer performance.
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning
cs.CV 2026-05 unverdicted novelty 6.0

A3B2 adds uncertainty-aware dampening and asymmetric MoE-style adapters to balance image and text branches, outperforming 11 baselines on 11 few-shot datasets.
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery
cs.CV 2026-05 unverdicted novelty 6.0

Relational Pattern Consistency improves generalized category discovery by using invariant relational patterns between novel samples and known-class prototypes for bidirectional knowledge transfer.
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 6.0

HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
ModelLens: Finding the Best for Your Task from Myriads of Models
cs.LG 2026-05 unverdicted novelty 6.0

ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency
cs.CV 2026-05 unverdicted novelty 6.0

DINORANKCLIP outperforms CLIP and RANKCLIP on fine-grained and out-of-distribution tasks by injecting DINOv3 local structure and using third-order ranking consistency trained on Conceptual Captions 3M.
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improve...
SpecPL: Disentangling Spectral Granularity for Prompt Learning
cs.CV 2026-05 unverdicted novelty 6.0

SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
cs.CV 2026-05 unverdicted novelty 6.0

IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
cs.AI 2026-05 unverdicted novelty 6.0

JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Three frameworks adapt foundation models for generalized category discovery under domain shifts via disentanglement and prompt tuning, showing gains on synthetic and real multi-domain data.
Prototype-Based Test-Time Adaptation of Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

PTA adapts VLMs at test time by maintaining and updating class-specific knowledge prototypes from test samples, achieving higher accuracy than cache-based methods with far less speed loss.
Prototype-Based Test-Time Adaptation of Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

PTA adapts VLMs at test time via adaptively weighted class prototypes that accumulate test-sample features, delivering higher accuracy than cache-based TTA while preserving nearly full inference speed.
HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
cs.CV 2026-04 unverdicted novelty 6.0

HyCal mitigates Domain Gravity in cross-discipline imbalanced few-shot class-incremental learning by calibrating prototypes with complementary directional and covariance-aware distances on frozen CLIP embeddings.
Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning
cs.CV 2026-04 unverdicted novelty 6.0

Dual-modality anchors from text descriptions and test-time image statistics filter views and ensemble predictions to improve test-time prompt tuning, achieving SOTA on 15 datasets.
Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models
cs.CV 2026-04 unverdicted novelty 6.0

AdvFLYP finetunes CLIP on web image-text pairs using adversarial contrastive learning and regularization to boost zero-shot adversarial robustness across domains better than prior proxy-dataset methods.
Visual prompting reimagined: The power of the Activation Prompts
cs.CV 2026-04 unverdicted novelty 6.0

Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.
R\'enyi Attention Entropy for Patch Pruning
cs.CV 2026-04 unverdicted novelty 6.0

Rényi entropy of attention maps serves as a tunable criterion for pruning redundant patches in vision transformers, reducing compute with preserved accuracy on image recognition.
The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
cs.LG 2026-03 unverdicted novelty 6.0

EAGC mitigates gradient entanglement in GCD by anchoring supervised gradients and adaptively projecting unlabeled ones, boosting existing methods to new state-of-the-art performance.
Perception Encoder: The best visual embeddings are not at the output of the network
cs.CV 2025-04 unverdicted novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
Visual-RFT: Visual Reinforcement Fine-Tuning
cs.CV 2025-03 conditional novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
EVA-CLIP: Improved Training Techniques for CLIP at Scale
cs.CV 2023-03 conditional novelty 6.0

EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
Information theoretic underpinning of self-supervised learning by clustering
cs.LG 2026-05 unverdicted novelty 5.0

SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery
cs.CV 2026-05 unverdicted novelty 5.0

LAGCD inserts residual linear adapters into each ViT block plus a distribution alignment loss to improve generalized category discovery by increasing model flexibility while reducing bias between seen and novel classes.
CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning
cs.LG 2026-05 unverdicted novelty 5.0

CERSA derives low-rank fine-tuning subspaces from SVD principal components that retain 90-95% spectral energy, delivering higher performance than LoRA and other PEFT baselines at substantially lower memory cost across...
Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
cs.CV 2026-05 unverdicted novelty 5.0

Fine-tuning impairs the class balance of foundation models in long-tailed personalized federated learning, which FedPuReL addresses through gradient purification using zero-shot predictions and residual-based personal...
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
cs.CV 2026-05 unverdicted novelty 5.0

Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
Leveraging Vision-Language Models as Weak Annotators in Active Learning
cs.CV 2026-05 unverdicted novelty 5.0

An active learning method combines VLM coarse weak labels with limited human fine labels via instance-wise assignment and noise modeling to outperform prior methods on CUB200 and FGVC-Aircraft under fixed annotation budgets.
Frequency-Enhanced Dual-Subspace Networks for Few-Shot Fine-Grained Image Classification
cs.CV 2026-04 unverdicted novelty 5.0

FEDSNet improves few-shot fine-grained image classification by fusing spatial texture and frequency-based structural subspaces to reduce noise overfitting.
Hierarchical Textual Knowledge for Enhanced Image Clustering
cs.CV 2026-04 unverdicted novelty 5.0

KEC constructs hierarchical textual knowledge from LLMs to create knowledge-enhanced image features that improve clustering performance over baselines and zero-shot CLIP on 20 datasets.
Holistic Optimal Label Selection for Robust Prompt Learning under Partial Labels
cs.CV 2026-04 unverdicted novelty 5.0

HopS selects robust labels for partial-label prompt learning via local density filtering and global optimal transport, improving performance over baselines on eight datasets.
Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics
cs.CV 2026-04 unverdicted novelty 5.0

GAPL anchors text prompts to second-order Gram matrix statistics to improve vision-language model adaptation across domains.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · cited by 47 Pith papers

[1]

Chatﬁeld, V

K. Chatﬁeld, V . Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In Proc. BMVC, 2011. 5

work page 2011
[2]

Novel dataset for ﬁne-grained image categorization

Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for ﬁne-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, 2011. 1

work page 2011
[3]

J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur. Dog breed classi- ﬁcation using part localization. In Proc. ECCV, 2012

work page 2012
[4]

Parkhi, A

O. Parkhi, A. Vedaldi, C. V . Jawahar, and A. Zisserman. Cats vs dogs. In Proc. CVPR, 2012. 1

work page 2012
[5]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report, California In- stitute of Technology, 2011. 1 6

work page 2011