pith. machine review for the scientific record. sign in

arxiv: 2304.07193 · v2 · submitted 2023-04-14 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

DINOv2: Learning Robust Visual Features without Supervision

Alaaeldin El-Nouby, Armand Joulin, Daniel Haziza, Francisco Massa, Gabriel Synnaeve, Herv\'e Jegou, Hu Xu, Huy Vo, Ishan Misra, Julien Mairal, Mahmoud Assran, Marc Szafraniec, Maxime Oquab, Michael Rabbat, Nicolas Ballas, Patrick Labatut, Pierre Fernandez, Piotr Bojanowski, Po-Yao Huang, Russell Howes, Shang-Wen Li, Th\'eo Moutakanni, Timoth\'ee Darcet, Vasil Khalidov, Vasu Sharma, Wojciech Galuba

Authors on Pith no claims yet

Pith reviewed 2026-05-09 01:50 UTC · model claude-opus-4-7

classification 💻 cs.CV
keywords <parameter name="0">self-supervised learning
0
0 comments X

The pith

Curated self-supervised pretraining on 142M images yields frozen visual features that match or beat OpenCLIP across image and pixel tasks without finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that self-supervised learning can produce general-purpose, frozen visual features competitive with text-image models, provided the pretraining data is large, diverse, and curated rather than scraped raw. The authors build LVD-142M by retrieving visual neighbours of curated seed datasets from a 1.2B-image web pool, with no captions or metadata involved. They then combine image-level and patch-level discriminative objectives with feature-spread regularisation, train a 1.1B-parameter ViT, and distill it into smaller models. Across classification, retrieval, segmentation, depth estimation, and video tasks, the frozen features beat prior self-supervised baselines by large margins and reach or exceed OpenCLIP-G with linear probes, while a finetuning sanity check adds only about two percentage points on ImageNet. The implicit message is that for vision features alone, language supervision is a convenience rather than a necessity.

Core claim

The authors argue that the gap between self-supervised image features and text-image (weakly supervised) features is not a property of self-supervision itself but a consequence of training on either small curated sets or large but uncurated web dumps. Given (i) a 142M-image dataset assembled by visual-similarity retrieval against curated seeds, (ii) a stabilised combination of image-level (DINO), patch-level (iBOT), Sinkhorn-Knopp centering and a KoLeo feature-spread regulariser, and (iii) a 1.1B-parameter ViT-g distilled into smaller ViTs, the resulting frozen features match or exceed OpenCLIP on classification, retrieval, segmentation, depth, and video tasks without any finetuning.

What carries the argument

A discriminative self-supervised recipe combining a DINO image-level cross-entropy loss on class tokens, an iBOT masked patch-level loss with untied heads, Sinkhorn-Knopp teacher centering, and a KoLeo nearest-neighbour entropy regulariser that spreads features uniformly; trained on LVD-142M, a dataset built by deduplicating 1.2B web images and retrieving visual neighbours of curated seed datasets without any text or metadata. Smaller models are produced by self-distillation from the ViT-g teacher rather than retrained from scratch.

If this is right

  • <parameter name="0">Frozen vision backbones can serve as drop-in feature extractors for downstream pipelines (segmentation
  • depth
  • retrieval) without per-task finetuning
  • simplifying deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • <parameter name="0">The retrieval-based curation step likely amplifies whatever biases exist in the seed datasets (ImageNet-22k
  • Google Landmarks
  • fine-grained sets)
  • which would explain the residual Western/high-income skew the authors report on Dollar Street and predicts where the features will quietly underperform.

Load-bearing premise

That visual-similarity retrieval against curated seed datasets produces a pretraining set diverse enough to generalise broadly, rather than one that quietly overfits to whatever the seed datasets already represent.

What would settle it

Evaluate the released frozen features with a linear probe on a benchmark whose visual domain is deliberately absent from the curated seeds (e.g. medical imagery, satellite imagery, or strongly non-Western Dollar Street images) and check whether they still match OpenCLIP-G; the paper itself reports a 25.7% Africa-vs-Europe gap on Dollar Street, so a wider domain audit could either confirm or break the "general-purpose" claim.

read the original abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 9 minor

Summary. The manuscript presents DINOv2, a family of self-supervised ViT models (S/B/L/g, up to 1.1B params) pretrained on a new automatically curated 142M-image dataset (LVD-142M) using a combined DINO+iBOT objective with several stabilization/efficiency improvements (KoLeo regularizer, Sinkhorn-Knopp centering, untied heads, SwiGLU FFN, FSDP mixed precision, sequence packing, efficient stochastic depth, FlashAttention-style kernels, short high-resolution adaptation phase, distillation of smaller models from ViT-g). The central empirical claim is that the resulting frozen features match or surpass the best openly available weakly-supervised features (OpenCLIP, EVA-CLIP, SWAG) across a broad suite of image- and pixel-level benchmarks (ImageNet linear/kNN/finetune, robustness sets, fine-grained classification, instance retrieval, video, semantic segmentation on ADE20k/Cityscapes/VOC, monocular depth on NYUd/KITTI/SUN-RGBD), and substantially exceed prior SSL baselines. The paper also reports ablations of training recipe, data source, model/data scaling, loss components, distillation, and resolution adaptation, plus fairness analyses on Dollar Street and Casual Conversations.

Significance. If the empirical claims hold, this is a substantial and useful contribution: it is the first SSL pipeline whose frozen features are broadly competitive with text-supervised models of comparable scale, on tasks ranging from linear ImageNet probing to dense prediction. The technical work is practical and well-documented — the engineering contributions (FSDP+mixed precision, sequence packing, efficient stochastic depth, the unsupervised retrieval-based curation pipeline) are independently useful, and the authors release code, pretrained checkpoints, and a carbon-cost accounting. The ablation in Tables 1, 2, 3a, 3b, and Fig. 4 separates the contributions of recipe, data, scale, and loss in a credible way. The qualitative emergent behaviors (PCA part-correspondence in Figs. 1/9, cross-domain depth/segmentation transfer in Fig. 8, patch matching in Fig. 10) are notable. The frozen-backbone Mask2Former+ViT-Adapter result on ADE20k (60.2 mIoU, §7.4) is a strong signal that the features are not over-specialized to linear probing.

major comments (4)
  1. [§3, Table 15, Table 18] Curation–evaluation asymmetry in the OpenCLIP comparison. LVD-142M is built by retrieving uncurated images close to a query set that explicitly includes the train splits of many evaluation benchmarks (Food-101, SUN397, Cars, Aircraft, VOC, DTD, Pets, Caltech101, Flowers, CUB, ADE20k, Cityscapes, VOC-Seg, NYU-Depth, KITTI, SUN-RGBD, R-Oxford, R-Paris, Met, AmsterTime, GLDv2, ImageNet-1k/22k). The deduplication step (§3) removes near-duplicates of *test/val* splits but not images visually close to *train* splits — that is in fact the design goal. Headline comparisons against OpenCLIP/EVA-CLIP (Tables 4, 8, 9, 10, 11) therefore confound 'better method' with 'pretraining distribution targeted at the evaluation manifolds.' The paper acknowledges curation matters (Table 2) but does not isolate the targeted-retrieval effect from generic curation. A controlled experiment is needed: e.g., re-trai
  2. [§7.1, Table 4; §7.2 Tables 7–8] Partial mitigations are reported but not consolidated. The strongest evidence that gains are not purely curation-driven is on benchmarks that were *not* retrieval queries: iNaturalist-2018/2021 and Places205 (Table 7), video (K400/UCF/SSv2, Table 7), and ImageNet-A/R/Sketch (Table 6). These results should be foregrounded as the primary 'general-purpose' claim, with the retrieval-targeted benchmarks (fine-grained, segmentation, retrieval) reported separately as in-distribution-by-design. As written, Tables 4/8–11 mix the two regimes, and the abstract's 'surpass OpenCLIP on most benchmarks' aggregates them.
  3. [§6.2, Table 2] The data-source ablation is informative but does not separate 'curation' from 'targeted curation.' The 'Uncurated data' baseline samples 142M images from the same web pool, while LVD-142M is the retrieval-shaped variant; INet-22k is a third condition. A missing fourth condition — curated/deduplicated 142M with retrieval queries restricted to a single broad source (e.g., INet-22k only) — would let the reader attribute gains to the multi-dataset retrieval-target design vs. generic curation. This is the same controlled experiment requested in major comment 1, and would also clarify the model-scale × data interaction in Fig. 4.
  4. [§5 (Distillation) and §6.5, Fig. 5] The distillation procedure is described loosely. §5 states 'we use a larger model as a frozen teacher, keep a spare EMA of the student … remove the masking and stochastic depth, and apply the iBOT loss on the two global crops.' It is not clear (i) whether the distilled students see LVD-142M or a different mix, (ii) what the loss weighting is between DINO and iBOT terms in this regime, and (iii) whether the 'spare EMA' student is the released checkpoint for ViT-S/B/L. Since distilled models are the ones most users will deploy, and Fig. 5 shows the distilled ViT-L sometimes exceeding its ViT-g teacher (e.g., Oxford-H), the procedure deserves a precise specification (pseudocode or formal loss) and a sanity check against accidental teacher leakage of evaluation-targeted information.
minor comments (9)
  1. [Abstract / §1] The abstract claims models 'surpass OpenCLIP on most of the benchmarks at image and pixel levels.' Given Table 8 (SUN −5.3, Cars −4.7), Table 7 Places205 (−2.3), and Table 6 (Im-R, Sketch worse than OpenCLIP-G), 'most' is defensible but should be qualified; consider 'on a majority of benchmarks, with notable exceptions on scene/web-text-heavy classification.'
  2. [Table 1] Some rows ('+SwiGLU FFN', '+Patch size 14', '+Sinkhorn-Knopp') show null or negative deltas on linear probing. The narrative ('each component improves … in most cases') is fine, but it would help to mark which entries are kept for downstream/training-stability reasons rather than accuracy.
  3. [§4, KoLeo regularizer] Appendix B.1 notes the KoLeo regularizer is computed only on class tokens within a single GPU without cross-GPU communication. This is a meaningful implementation detail (regularizer strength scales with per-GPU batch) and should be mentioned in the main text where the loss is introduced.
  4. [§6.6, Fig. 6] The high-resolution adaptation ablation is run only on ViT-L/16 trained on ImageNet-1k, not on the LVD-142M setup actually shipped. A short note on transferability of the conclusion to the production setup would strengthen the section.
  5. [§7.4, Table 10] The state-of-the-art reference numbers in parentheses (62.9, 86.9, 89.0) and the Mask2Former+ViT-Adapter 60.2 number on ADE20k come from heterogeneous training regimes. A footnote summarizing decoder, training data, and crop size for each comparator would aid interpretation.
  6. [§8.1, Table 12] The geographic-fairness gap (Africa vs. Europe, low vs. high income) is reported but not connected back to LVD-142M's composition. A brief discussion of which retrieval-query datasets are likely Western-centric would make this section more actionable.
  7. [Fig. 2 / Fig. 4] Y-axes for several panels (e.g., 'Monocular Depth') invert the better-direction (lower R-MSE is better). A small annotation ('↓ better') would prevent misreading.
  8. [§5, ViT-g architecture] Embedding dim 1536 / 24 heads vs. Zhai et al.'s 1408 / 16 heads is documented, but the parameter count of 1.1B should be tabulated against the original ViT-g for clarity (Table 17 has shape but not total params).
  9. [Appendix B.1] Table 16 gives most pretraining hyperparameters but omits the iBOT/DINO loss weight ratio and the masking ratio used in iBOT. These are needed for reproduction.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for a careful and constructive report. The three major comments converge on a single legitimate concern — that LVD-142M's retrieval-based construction uses many evaluation train splits as queries, so headline comparisons to OpenCLIP/EVA-CLIP partially confound method quality with pretraining-distribution targeting — together with two specific requests (a controlled curation ablation and a precise specification of the distillation procedure). We accept all three points and propose concrete revisions: (i) a new controlled condition in Table 2 that retrieves from a single broad query source (ImageNet-22k only) at the same 142M scale, isolating multi-target retrieval from generic curation; (ii) a restructuring of §7 and the abstract that foregrounds genuinely held-out benchmarks (iNaturalist, Places205, video, ImageNet robustness) as the primary general-purpose evidence and annotates each table with a 'retrieval-query / held-out' tag; (iii) a tightened §5 with formal loss and pseudocode for the distillation procedure, explicit statements that distilled students train on LVD-142M and that the released checkpoint is the student EMA. We note that test/val deduplication against every evaluation benchmark is already performed (§3, Appendix A.3), which addresses strict leakage; the referee's concern is the softer distribution-shaping effect, and the revisions above address it directly. We believe these changes preserve the paper's contributions while making the emp

read point-by-point responses
  1. Referee: Curation–evaluation asymmetry: LVD-142M retrieval queries include train splits of many evaluation benchmarks; comparisons to OpenCLIP/EVA-CLIP confound method vs. distribution-targeting. A controlled experiment isolating targeted retrieval from generic curation is requested.

    Authors: We agree this is a legitimate concern and one we should make more explicit in the manuscript. Two clarifications and one new experiment we propose to add: (i) Test/val deduplication. As stated in §3 and Appendix A.3, we remove near-duplicates of *both* test and validation splits of every benchmark used in the paper, using the SSCD copy-detection embeddings of Pizzi et al. (2022) at similarity >0.45 over the full image (not just train splits). The retrieval queries are train images, but no near-duplicate of any evaluation image enters LVD-142M. This rules out leakage in the strict sense; the referee's concern is the softer 'distribution-shaping' effect, which is real and which we did not cleanly isolate. (ii) The referee is correct that Table 2's 'Uncurated' baseline conflates curation with targeted curation. We will run, and report in the revision, the requested controlled condition: an LVD-142M-sized dataset whose retrieval queries are restricted to ImageNet-22k only (no fine-grained / segmentation / retrieval / depth query datasets). This isolates the contribution of multi-source targeted retrieval from generic SSL-style curation against a single broad source. (iii) We will rewrite the abstract and §7 framing to separate retrieval-targeted benchmarks from genuinely held-out ones (see response to next comment), so the 'surpasses OpenCLIP on most benchmarks' claim is qualified by which benchmarks are in-distribution-by-design. revision: yes

  2. Referee: Foreground the held-out benchmarks (iNaturalist 2018/2021, Places205, video K400/UCF/SSv2, ImageNet-A/R/Sketch) as the primary general-purpose claim; report retrieval-targeted benchmarks separately.

    Authors: We accept this restructuring. iNaturalist-2018, iNaturalist-2021, Places205, the three video benchmarks (K400, UCF-101, SSv2), and ImageNet-{A,R,C,Sketch} were not retrieval query sources (cf. Table 18) and constitute the cleanest evidence for general-purpose features. In the revised §7 we will (a) introduce a 'held-out evaluation' subsection that aggregates these results before any retrieval-targeted benchmarks, (b) annotate Tables 4, 7, 8, 9, 10, 11 with a column indicating whether the benchmark's train split was used as a retrieval query, and (c) revise the abstract to read along the lines of 'on a suite of held-out benchmarks (iNaturalist, Places205, video, ImageNet robustness sets) our frozen features match or surpass OpenCLIP, and on benchmarks whose training data was used as retrieval queries the gains are larger but partly attributable to targeted curation.' We believe the held-out evidence alone (e.g., +8.6/+9.7% on iNat-2018/2021 vs. OpenCLIP-G in Table 7, +12.1% on ImageNet-A vs. OpenCLIP-G in Table 6, +2.5% on SSv2) supports the general-purpose claim, and stating this cleanly will strengthen the paper. revision: yes

  3. Referee: Distillation procedure underspecified: training data, DINO/iBOT loss weighting, identity of the released checkpoint (EMA vs. student), and risk of teacher leakage of evaluation-targeted information.

    Authors: We will tighten §5 and add a pseudocode block plus an explicit loss specification in Appendix B. To answer the three sub-questions directly: (i) Data. Distilled students (ViT-S/B/L distilled from ViT-g) are trained on LVD-142M, the same data as the ViT-g teacher. We will state this explicitly and note the implication for the leakage concern: the teacher's exposure to retrieval-shaped data is inherited by the students, so the curation–evaluation asymmetry discussed above applies equally to the released small models. (ii) Loss. Distillation reuses the same DINO + iBOT + KoLeo objective as pretraining, with identical loss weights; the only differences are (a) the teacher is the frozen ViT-g rather than an EMA of the student, (b) masking and stochastic depth are disabled on the student, and (c) iBOT is computed on both global crops (rather than on masked patches). There is no separate distillation loss term and no temperature/weight tuning beyond the pretraining recipe. We will write this out as a formal expression. (iii) Released checkpoint. The released ViT-S/B/L distilled checkpoints are the *EMA of the student*, not the student itself, matching the pretraining convention. We will state this in §5 and in the model card. On leakage: the teacher was not trained with any supervised signal from evaluation labels, so 'teacher leakage of evaluation-targeted information' is bounded by what the teacher itself learned from LVD-142M, i.e., the same caveat as in major comment 1. The controlled-curation experiment we add will therefore also constrain this concern for the distilled models. revision: yes

standing simulated objections not resolved
  • Even with the requested ImageNet-22k-only retrieval ablation, we cannot fully decouple targeted curation from method quality on the fine-grained, segmentation, retrieval, and depth benchmarks whose train sets were used as queries: those evaluations remain in-distribution-by-design relative to LVD-142M, and the revised text will say so rather than claim a clean separation.

Circularity Check

1 steps flagged

No formal circularity in the derivation; the only structural concern is benchmark-aware data curation, which the paper partially neutralizes with held-out evaluations.

specific steps
  1. other [§3 Data Processing; Appendix Table 15]
    "We assemble our curated LVD-142M dataset by retrieving, from a large pool of uncurated data, images that are close to those in several curated datasets. ... Our selection of curated datasets ... contains ImageNet-22k, the train split of ImageNet-1k, Google Landmarks and several fine-grained datasets."

    Not circular in the formal sense (no quantity is fit and then re-reported as a prediction), but pretraining mass is shaped toward the train splits of many eval datasets (Table 15: Food-101, SUN397, Cars, Aircraft, VOC, DTD, Pets, Caltech101, Flowers, CUB, ADE20k, Cityscapes, NYU, KITTI, R-Ox/Paris, Met, AmsterTime, GLDv2). Test-set duplicates are removed but train-distribution shaping is not. This makes the OpenCLIP head-to-head asymmetric. Held-out non-query evaluations (iNat, Places205, video, IN-A/R/Sk) still show wins, so independent support remains — hence low score, not high.

full rationale

DINOv2 is an empirical paper. There is no analytical "derivation chain" of the kind that admits self-definitional circularity (no fitted parameter is renamed as a prediction, no uniqueness theorem is imported from the authors to forbid alternatives, no ansatz is smuggled via self-citation). The training objective (DINO + iBOT + KoLeo + Sinkhorn-Knopp) is assembled from prior published work, and the headline claims ("matches/exceeds OpenCLIP on a broad benchmark suite") are evaluated against external models the authors do not control. That is the opposite of a circular argument. The one structural concern, raised by the skeptic, is methodological rather than circular: LVD-142M is constructed by retrieving uncurated images that are visually close to a curated query set (Table 15) which substantially overlaps with the eval suite (ImageNet-1k/22k, Food-101, SUN397, Cars, Aircraft, VOC, DTD, Pets, Caltech101, Flowers, CUB, ADE20k, Cityscapes, VOC-Seg, NYU, KITTI, SUN-RGBD, R-Oxford, R-Paris, Met, AmsterTime, GLDv2). Combined with test-set deduplication (§3) but not train-distribution decontamination, this shapes pretraining mass toward the manifolds of benchmarks later used to compare against OpenCLIP. This is a confound on the OpenCLIP comparison, not a circularity in the formal sense — labels are never used, and the per-task heads are trained downstream. The paper itself partially defuses the concern with held-out evaluations on datasets that were not retrieval queries: iNaturalist-2018/2021 and Places205 (Table 7), video (K400/UCF/SSv2, Table 7), and robustness sets ImageNet-A/R/Sketch (Table 6), where DINOv2 still wins or matches. Table 2's "Uncurated data" row also shows curation alone is not the whole story. I score this 2: the retrieval pipeline introduces a mild measurement-asymmetry concern worth flagging, but no load-bearing claim reduces to its own input by construction or to an unverified self-citation. The "general-purpose features" headline would be strengthened by a fully retrieval-disjoint benchmark partition; the existing held-out evidence is non-trivial.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Model omitted the axiom ledger; defaulted for pipeline continuity.

pith-pipeline@v0.9.0 · 9782 in / 6222 out tokens · 101713 ms · 2026-05-09T01:50:51.976975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

    cs.GR 2026-05 unverdicted novelty 8.0

    Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset ...

  2. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  3. neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

    cs.CV 2026-04 unverdicted novelty 8.0

    neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

  4. Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

    cs.CV 2026-04 unverdicted novelty 8.0

    The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

  5. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  6. RotVLA: Rotational Latent Action for Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 7.0

    RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

  7. Does Engram Do Memory Retrieval in Autoregressive Image Generation?

    cs.CV 2026-05 accept novelty 7.0

    Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.

  8. SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude ...

  9. Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

    cs.LG 2026-05 unverdicted novelty 7.0

    Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibrat...

  10. CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

    cs.CV 2026-05 unverdicted novelty 7.0

    CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

  11. Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...

  12. DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

    cs.RO 2026-05 unverdicted novelty 7.0

    DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

  13. PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 7.0

    PointGS achieves semantic-consistent unsupervised 3D point cloud segmentation by using 3D Gaussian Splatting to bridge discrete points and continuous 2D images for distilling SAM semantics.

  14. STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.

  15. Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

    cs.RO 2026-05 conditional novelty 7.0

    A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

  16. Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

    cs.CV 2026-05 unverdicted novelty 7.0

    DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revea...

  17. Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

    cs.CV 2026-05 unverdicted novelty 7.0

    DRoRAE fuses multi-layer features from pretrained vision encoders to recover lost low-level details, reducing rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256.

  18. VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

  19. When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation

    cs.CV 2026-05 conditional novelty 7.0

    Raw CSD cosine similarity produces negative discrimination gaps for many artists and does not support absolute style-fidelity interpretation, but CSLS readout on frozen backbones reduces failures and improves AUC.

  20. PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

    cs.CV 2026-05 unverdicted novelty 7.0

    PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preservin...

  21. MPD$^2$-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis

    cs.AI 2026-05 unverdicted novelty 7.0

    MPD²-Router is a dual-head deferral router that uses mask-aware Gumbel-sigmoid gating, asymmetric cost-sensitive training, and rank-majorization regularization to lower clinical cost and raise MCC versus AI-only basel...

  22. Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks

    eess.IV 2026-05 unverdicted novelty 7.0

    Self-supervised monocular depth estimation improves in low-texture regions by using distance transforms on jointly estimated pre-semantic contours to create more informative loss signals.

  23. What Cohort INRs Encode and Where to Freeze Them

    cs.LG 2026-05 unverdicted novelty 7.0

    Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

  24. Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...

  25. SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

    cs.CV 2026-05 unverdicted novelty 7.0

    SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.

  26. LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling

    cs.CV 2026-05 unverdicted novelty 7.0

    LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.

  27. From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 7.0

    Direct primitive comparison in Gaussian splatting with anisotropic drift models and observability terms enables multi-view consistent change detection that separates geometric and appearance changes, outperforming pri...

  28. From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 7.0

    GS-DIFF detects changes in 3D Gaussian Splatting scenes by direct primitive attribute comparison with anisotropic drift models and observability terms, outperforming render-then-compare baselines by ~17% mIoU.

  29. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  30. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  31. Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

    cs.AI 2026-05 unverdicted novelty 7.0

    ProCompNav disambiguates ambiguous instance navigation queries via candidate-pool construction followed by attribute-based comparative binary questions that prune distractors, yielding higher success rates and shorter...

  32. Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

    cs.AI 2026-05 unverdicted novelty 7.0

    ProCompNav improves success rate and shortens user responses in ambiguous instance navigation by using comparative binary questions that prune a candidate pool rather than requesting detailed descriptions.

  33. A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...

  34. A foundation model of vision, audition, and language for in-silico neuroscience

    q-bio.NC 2026-05 unverdicted novelty 7.0

    TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.

  35. Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

    cs.CV 2026-05 unverdicted novelty 7.0

    HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.

  36. ReLeaf: Benchmarking Leaf Segmentation across Domains and Species

    cs.CV 2026-05 unverdicted novelty 7.0

    A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.

  37. Automated In-the-Wild Data Collection for Continual AI Generated Image Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    An automated fact-check-based pipeline for in-the-wild AI image data, when mixed with generator data in continual learning, lets detectors adapt to new generators while avoiding forgetting and delivers 8-9% accuracy g...

  38. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  39. Towards Visual Query Localization in the 3D World

    cs.CV 2026-05 unverdicted novelty 7.0

    The authors release the 3DVQL benchmark for 3D multimodal visual query localization and show that a lift-and-attention fusion module outperforms prior fusion baselines on it.

  40. Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

    cs.RO 2026-05 unverdicted novelty 7.0

    Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.

  41. Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

    cs.CV 2026-05 unverdicted novelty 7.0

    Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...

  42. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  43. Rethink MAE with Linear Time-Invariant Dynamics

    cs.CV 2026-04 unverdicted novelty 7.0

    Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.

  44. AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

    cs.CV 2026-04 conditional novelty 7.0

    AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.

  45. Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.

  46. LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...

  47. VitaminP: cross-modal learning enables whole-cell segmentation from routine histology

    cs.CV 2026-04 unverdicted novelty 7.0

    VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.

  48. MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.

  49. MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.

  50. VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

    cs.RO 2026-04 unverdicted novelty 7.0

    VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

  51. WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

    cs.CV 2026-04 unverdicted novelty 7.0

    WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.

  52. Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

    cs.CV 2026-04 unverdicted novelty 7.0

    Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench f...

  53. Evaluating Remote Sensing Image Captions Beyond Metric Biases

    cs.CV 2026-04 unverdicted novelty 7.0

    Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...

  54. Frequency-Forcing: From Scaling-as-Time to Soft Frequency Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    Frequency-Forcing guides pixel flow-matching with a data-derived low-frequency auxiliary stream to softly enforce scale-ordered generation, improving FID on ImageNet-256 over baselines.

  55. TransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    TransSplat uses unbalanced semantic transport to match edited 2D evidence with 3D Gaussians and recover a shared 3D edit field, yielding better local accuracy and structural consistency than prior view-consistency methods.

  56. Generative Texture Filtering

    cs.CV 2026-04 unverdicted novelty 7.0

    A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.

  57. Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model

    cs.CV 2026-04 unverdicted novelty 7.0

    Distillation from visual foundation models to lidar enables frame-wise indoor semantic segmentation without manual annotations, achieving up to 56% mIoU on pseudo labels and 36% on real labels.

  58. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  59. CFSR: Geometry-Conditioned Shadow Removal via Physical Disentanglement

    cs.CV 2026-04 unverdicted novelty 7.0

    CFSR reframes shadow removal as a physics-constrained process using geometric and semantic priors from depth, DINO, CLIP, and frequency decoupling to achieve claimed state-of-the-art results.

  60. MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene

    cs.CV 2026-04 unverdicted novelty 7.0

    MU-GeNeRF combines source-view and target-view uncertainties via a heteroscedastic loss to enable distractor-aware generalizable NeRF reconstruction that matches scene-specific methods.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 323 Pith papers · 8 internal anchors

  1. [1]

    3For context, a full Boeing 777 return flight between London and New York corresponds to approximately 560 tCO2eq. 21 Published in Transactions on Machine Learning Research (01/2024) Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a join...

  2. [2]

    MultiGrain : a unified image embedding for classes and instances

    Maxim Berman, Hervé Jégou, Vedaldi Andrea, Iasonas Kokkinos, and Matthijs Douze. MultiGrain: a unified image embedding for classes and instances.arXiv preprint arXiv:1902.05509,

  3. [3]

    Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020

    Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159,

  4. [4]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258,

  5. [5]

    Symbolic discovery of optimization algorithms,

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, et al. Symbolic discovery of optimization algorithms.arXiv preprint arXiv:2302.06675, 2023a. Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. InCVPR,

  6. [6]

    An empirical study of training self-supervised vision transformers

    22 Published in Transactions on Machine Learning Research (01/2024) Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV,

  7. [7]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,

  8. [8]

    A simple recipe for competitive low-compute self supervised vision models

    Quentin Duval, Ishan Misra, and Nicolas Ballas. A simple recipe for competitive low-compute self supervised vision models. arXiv preprint arXiv:2301.09451,

  9. [9]

    arXiv preprint arXiv:2112.10740 , year=

    Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, and Edouard Grave. Are large-scale datasets necessary for self-supervised pre-training?arXiv preprint arXiv:2112.10740,

  10. [10]

    Eva: Exploring the limits of masked visual representation learning at scale

    23 Published in Transactions on Machine Learning Research (01/2024) Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InCVPR,

  11. [11]

    preprint arXiv:2103.01988 , year=

    Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, et al. Self-supervised pretraining of visual features in the wild. preprint arXiv:2103.01988,

  12. [12]

    arXiv preprint arXiv:2202.08360 , year=

    Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Mannat Singh, Ishan Misra, Levent Sagun, Armand Joulin, and Piotr Bojanowski. Vision models are more robust and fair when pretrained on uncurated images without supervision.arXiv preprint arXiv:2202.08360, 2022a. Priya Goyal, Adriana Romero Soriano, Caner Hazirbas, Levent Sagun, and Nicolas Usunie...

  13. [13]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    24 Published in Transactions on Machine Learning Research (01/2024) Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, 2021a. Dan Hendrycks, Kevin Zhao, Steven Basart, J...

  14. [14]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  15. [15]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950,

  16. [16]

    Binsformer: Revisiting adaptive bins for monocular depth estimation

    Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and Jianfeng Gao. Efficient self-supervised vision transformers for representation learning. InICLR, 2022a. Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation.arXiv preprint arXiv:2204.00987, 2022b. T...

  17. [17]

    arXiv preprint arXiv:2107.00782 , year=

    Huajun Liu, Fuqiang Liu, Xinyi Fan, and Dong Huang. Polarized self-attention: towards high-quality pixel- wise regression. arXiv preprint arXiv:2107.00782,

  18. [18]

    25 Published in Transactions on Machine Learning Research (01/2024) S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report,

  19. [19]

    Carbon Emissions and Large Neural Network Training

    David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350,

  20. [20]

    Learning to Generate Reviews and Discovering Sentiment , publisher =

    Filip Radenović, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondřej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InCVPR, 2018a. Filip Radenović, Giorgos Tolias, and Ondřej Chum. Fine-tuning cnn image retrieval with no human anno- tation. TPAMI, 2018b. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to gen...

  21. [21]

    Imagenet large scale visual recognition challenge.IJCV,

    26 Published in Transactions on Machine Learning Research (01/2024) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge.IJCV,

  22. [22]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

  23. [23]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402,

  24. [24]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  25. [25]

    Bench- marking representation learning for natural world image collections

    27 Published in Transactions on Machine Learning Research (01/2024) Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber, Serge Belongie, and Oisin Mac Aodha. Bench- marking representation learning for natural world image collections. InCVPR,

  26. [26]

    Technical Report CNS-TR-2010-001,

  27. [27]

    arXiv preprint arXiv:2207.06405 , year=

    HuXu, JunchengLi, AlexeiBaevski, MichaelAuli, WojciechGaluba, FlorianMetze, ChristophFeichtenhofer, et al. Masked autoencoders that listen.arXiv preprint arXiv:2207.06405,

  28. [28]

    Billion-scale semi-supervised learning for image classification

    I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification.arXiv preprint arXiv:1905.00546,

  29. [29]

    arXiv preprint arXiv:2203.14415 , year=

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. InICLR, 2022a. Pan Zhou, Yichen Zhou, Chenyang Si, Weihao Yu, Teck Khim Ng, and Shuicheng Yan. Mugs: A multi- granular self-supervised learning framework.arXiv preprint arXiv:2203.14415, 2022b. 28 Published in Transacti...

  30. [30]

    This collection is intended to provide images covering well various downstream vision tasks both for image-level and dense recognition. A.2 Image similarity We employ cosine similarity to compare image features (whether ours or feature generated for deduplication) with the following similarity functionm: m(s, r) = cosine-similarity(f (s) , f(r)) = f(s)·f(...

  31. [31]

    We apply the KoLeo regularizer with a weight of 0.1 between the class tokens of the first global crop, for all samples within a GPU without cross-communication for this step

    KoLeo regularization. We apply the KoLeo regularizer with a weight of 0.1 between the class tokens of the first global crop, for all samples within a GPU without cross-communication for this step. 29 Published in Transactions on Machine Learning Research (01/2024) Task Dataset / Split Images Retrieval Retrieved Final classification ImageNet-22k / – 14,197...

  32. [32]

    30 Published in Transactions on Machine Learning Research (01/2024) Arch. Drop-rate LR Batch size DINOv2-S (distilled) ViT-S/14 0 1e-3 2048 DINOv2-B (distilled) ViT-B/14 0 1e-3 2048 DINOv2-L (distilled) ViT-L/14 0 1e-3 2048 DINOv2-L (from scratch) ViT-L/14 0.4 3.5e-4 3072 DINOv2-g (from scratch) ViT-g/14 0.4 3.5e-4 3072 Table 16:Training hyperparameters f...

  33. [33]

    EMA update for the teacher

    when training from scratch. EMA update for the teacher. The teacher is initialized with the same state as the student, and is an exponential moving average of the student network, with a momentum value in [0.994, 1.0] following a cosine schedule. It is updated at the end of every training step. B.2 High-Resolution adaptation We initialise the model with t...

  34. [34]

    (Everingham et al.,

    VOC 2007 ✗ ✓ ✓ Classif. (Everingham et al.,

  35. [35]

    (Van Horn et al.,

    iNaturalist 2018 ✗ ✗ ✓ Classif. (Van Horn et al.,

  36. [36]

    (Van Horn et al.,

    iNaturalist 2021 ✗ ✗ ✓ Classif. (Van Horn et al.,

  37. [37]

    (Everingham et al.,

    VOC 2012 ✗ ✓ ✓ Seg. (Everingham et al.,