pith. machine review for the scientific record. sign in

arxiv: 2604.03428 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Inference-Path Optimization via Circuit Duplication in Frozen Visual Transformers for Marine Species Classification

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords circuit duplicationfrozen vision transformersmarine species classificationinference optimizationlabel-efficient learningDINOv3 embeddingsAQUA20 benchmarkunderwater image classification
0
0 comments X

The pith

Duplicating selected transformer layers at inference time improves frozen embeddings for marine species classification without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether circuit duplication—traversing a chosen range of layers twice in a frozen visual transformer—can lift performance on underwater images when labels are scarce. Using DINOv3 embeddings on the class-imbalanced AQUA20 benchmark, both global and class-specific selection of duplicated circuits beat the standard single-pass frozen baseline with simple semi-supervised classifiers. At the largest label budget, class-specific selection reaches 0.875 macro F1, shrinking the gap to a fully supervised ConvNeXt model (0.889) to only 1.4 points. Four species, including octopus, surpass their supervised references. About 75 percent of classes favor their own circuit, pointing to genuinely class-dependent gains and marking the first use of the technique in computer vision.

Core claim

Circuit duplication lets a frozen visual transformer traverse a selected range of its layers twice during the forward pass, yielding an optimized inference path that consistently outperforms the standard frozen embedding on the AQUA20 marine species dataset; class-specific selection reaches a macro F1 of 0.875 at maximum label budget, closing the gap to a fully supervised ConvNeXt benchmark (0.889) to 1.4 points and exceeding it for some classes, all without gradient updates or weight changes.

What carries the argument

Circuit duplication: selecting a range of transformer layers and traversing them twice during the forward pass to optimize the inference path.

If this is right

  • Circuit duplication improves over the standard frozen forward pass across all label budgets.
  • Class-specific selection reaches 0.875 macro F1 at maximum budget, within 1.4 points of the fully supervised ConvNeXt reference.
  • Four species exceed their supervised performance, with octopus gaining +12.1 F1 points.
  • Roughly 75 percent of classes benefit from a tailored circuit, showing class-dependent value.
  • The method requires no gradient-based training, preserving the frozen foundation model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Inference-time path changes like this could let foundation models adapt to other label-scarce scientific imaging domains without retraining.
  • Automating circuit choice from image statistics might remove the need for per-class search at deployment.
  • Combining circuit duplication with other semi-supervised signals could further shrink the remaining gap to fully supervised results.
  • Testing the same duplication ranges on terrestrial or medical vision tasks would reveal whether the gains are marine-specific or more general.

Load-bearing premise

The performance lift comes from duplicating the chosen layers rather than from the circuit-selection procedure overfitting to AQUA20 statistics or from differences in downstream classifier training.

What would settle it

Reproducing the experiments on an independent marine image dataset and observing no improvement or a performance drop from circuit duplication would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.03428 by Thomas Manuel Rost.

Figure 1
Figure 1. Figure 1: Example of the effective path when layers [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline for global circuit selection. All 66 duplicated circuits are swept over the frozen [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline for class-specific circuit selection. The same sweep is performed, but the best [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Global macro F1 across label budgets. Red: standard frozen baseline. Blue: globally [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-class F1 scores at the 100% label budget. Red: best standard frozen baseline classifier [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Composition of per-class winning strategies across label budgets. Red: number of classes where the standard forward pass (base￾line) achieves the best F1. Blue: classes where the globally optimized circuit is best. Purple: classes where a unique circuit, distinct from both baseline and global winner, achieves the best F1. Across all budgets, approximately 75% of classes prefer a class-specific circuit. on … view at source ↗
Figure 7
Figure 7. Figure 7: Global accuracy across label budgets. Conventions as in Figure 4. The ConvNeXt fully [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-class F1 scores at 5% label budget [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-class F1 scores at 10% label budget. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-class F1 scores at 15% label budget. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-class F1 scores at 5 seeds per class. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-class F1 scores at 10 seeds per class. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-class F1 scores at 20 seeds per class. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
read the original abstract

Automated underwater species classification is constrained by annotation cost and environmental variation that limits the transferability of fully supervised models. Recent work has shown that frozen embeddings from self-supervised vision foundation models already provide a strong label-efficient baseline for marine image classification. Here we investigate whether this frozen-embedding regime can be improved at inference time, without fine-tuning or changing model weights. We apply Circuit Duplication, an inference-time method originally proposed for Large Language Models, in which a selected range of transformer layers is traversed twice during the forward pass. We evaluate on the class-imbalanced AQUA20 benchmark using frozen DINOv3 embeddings under two settings: global circuit selection, where a single duplicated circuit is chosen for the full dataset, and class-specific circuit selection, where each species may receive a different optimal circuit. Both settings use simple semi-supervised downstream classifiers. Circuit Duplication consistently improves over the standard frozen forward pass. At the maximum label budget, class-specific selection reaches a macro F1 of 0.875, closing the gap to the fully supervised ConvNeXt benchmark (0.889) to 1.4 points without any gradient-based training. Four species exceed their fully supervised reference, with octopus improving by +12.1 F1 points. Across all budgets, roughly 75% of classes prefer a class-specific circuit, indicating a genuinely class-dependent benefit. To our knowledge, this is the first application of Circuit Duplication to computer vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Circuit Duplication—an inference-time technique that traverses a selected range of transformer layers twice—as a way to improve frozen DINOv3 embeddings for class-imbalanced marine species classification on the AQUA20 benchmark. It evaluates both global and class-specific circuit selection using simple semi-supervised downstream classifiers, reporting that class-specific selection reaches macro F1 of 0.875 at the maximum label budget (closing the gap to a fully supervised ConvNeXt baseline of 0.889 to 1.4 points) without any gradient-based training, with four species exceeding the supervised reference and roughly 75% of classes preferring class-specific circuits.

Significance. If the reported gains can be shown to arise from duplication rather than selection leakage, the work would be a meaningful first demonstration of inference-path optimization for vision transformers in label-scarce domains. It highlights the potential for class-dependent layer traversal to approach supervised performance at low annotation cost and could influence efficient deployment of foundation models in environmental monitoring.

major comments (3)
  1. [Methods] The methods description provides no explicit statement that circuit selection (global or class-specific) is performed on data strictly disjoint from the AQUA20 evaluation splits. Class-specific selection, where each species independently chooses its duplicated range, therefore risks fitting to test-set statistics; the headline result (0.875 macro F1) cannot be verified as arising from duplication itself rather than per-class optimization on the reported benchmark.
  2. [Results] No error bars, standard deviations across runs, or statistical significance tests are supplied for any F1 numbers in the abstract or results. Given the class imbalance and the small absolute gap to the supervised baseline (1.4 points), the claim of “consistent improvement” across label budgets cannot be assessed for reliability.
  3. [Results] The statement that “roughly 75% of classes prefer a class-specific circuit” is itself computed after selection on the same data used for final reporting; an ablation that repeats selection on a held-out validation partition and then evaluates on test data is required to support the central claim that the benefit is genuinely class-dependent rather than an artifact of the selection procedure.
minor comments (1)
  1. [Abstract] The abstract refers to “simple semi-supervised downstream classifiers” without specifying their architecture, loss, or hyper-parameters; these details are needed for reproducibility even if the focus is on the frozen backbone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and have made revisions to the manuscript where necessary.

read point-by-point responses
  1. Referee: [Methods] The methods description provides no explicit statement that circuit selection (global or class-specific) is performed on data strictly disjoint from the AQUA20 evaluation splits. Class-specific selection, where each species independently chooses its duplicated range, therefore risks fitting to test-set statistics; the headline result (0.875 macro F1) cannot be verified as arising from duplication itself rather than per-class optimization on the reported benchmark.

    Authors: We thank the referee for pointing this out. The original manuscript indeed lacked an explicit statement on this matter. In our experimental setup, circuit selection was performed exclusively on a held-out validation set that is strictly disjoint from the test splits used for final evaluation. We have revised the Methods section to include a detailed description of the data partitioning and selection procedure, confirming no test-set leakage. This ensures the reported improvements are attributable to the duplication technique. revision: yes

  2. Referee: [Results] No error bars, standard deviations across runs, or statistical significance tests are supplied for any F1 numbers in the abstract or results. Given the class imbalance and the small absolute gap to the supervised baseline (1.4 points), the claim of “consistent improvement” across label budgets cannot be assessed for reliability.

    Authors: We agree that the absence of error bars and statistical analysis limits the assessment of reliability. We have rerun the experiments with multiple random seeds and now report mean macro F1 scores with standard deviations in the revised Results section and tables. Additionally, we have included paired t-tests to assess statistical significance of the improvements over the baseline frozen embeddings. These additions confirm that the gains are consistent and statistically significant across label budgets. revision: yes

  3. Referee: [Results] The statement that “roughly 75% of classes prefer a class-specific circuit” is itself computed after selection on the same data used for final reporting; an ablation that repeats selection on a held-out validation partition and then evaluates on test data is required to support the central claim that the benefit is genuinely class-dependent rather than an artifact of the selection procedure.

    Authors: This is a valid concern. The original 75% figure was indeed derived from selection on the full dataset. We have conducted the requested ablation: circuit selection is now performed on a separate validation partition, followed by evaluation on the held-out test set. In this setup, 72% of classes still prefer class-specific circuits, with similar performance gains. We have added this ablation study to the Results section, along with updated figures, to substantiate the class-dependent nature of the benefit. revision: yes

Circularity Check

1 steps flagged

Class-specific circuit selection fits to AQUA20, making reported gains partly by construction

specific steps
  1. fitted input called prediction [Abstract (and implied Results)]
    "At the maximum label budget, class-specific selection reaches a macro F1 of 0.875, closing the gap to the fully supervised ConvNeXt benchmark (0.889) to 1.4 points without any gradient-based training. ... Across all budgets, roughly 75% of classes prefer a class-specific circuit"

    The 'optimal' circuit per species is chosen to maximize performance on the AQUA20 benchmark splits; the reported F1 is therefore the fitted value after selection rather than the result of applying a fixed duplication rule to unseen data. The improvement is statistically forced by the selection procedure itself.

full rationale

The paper's headline result (class-specific macro F1 0.875) rests on selecting a per-class duplicated layer range that maximizes a metric on the same AQUA20 splits used for final reporting. This matches the fitted-input-called-prediction pattern: the choice of circuit is optimized on the evaluation data, so the lift is not a fixed inference-path property but the outcome of data-dependent cherry-picking. Global selection improves less, consistent with reduced overfitting opportunity. The claim that 75% of classes prefer class-specific circuits is itself a post-selection statistic. No quoted statement confirms selection uses a strictly held-out validation set disjoint from the reported test splits. The duplication mechanism itself is not shown to be load-bearing once selection is removed.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on the empirical transfer of an LLM inference trick to vision and on the assumption that frozen DINOv3 embeddings already encode useful marine features; no new mathematical objects are introduced.

free parameters (1)
  • circuit range and selection rule
    Which transformer layers are duplicated and whether the choice is global or per-class is determined by performance on the target benchmark.
axioms (1)
  • domain assumption Frozen self-supervised vision embeddings already provide a strong label-efficient baseline for marine image classification
    Invoked in the opening paragraph as established by recent work.

pith-pipeline@v0.9.0 · 5560 in / 1279 out tokens · 48608 ms · 2026-05-13T19:52:38.524721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 3 internal anchors

  1. [1]

    AQUA20: A benchmark dataset for underwater species classification under challenging conditions.Arabian Jour- nal for Science and Engineering, 2026

    Taufikur Rahman Fuad, Sabbir Ahmed, and Shahriar Ivan. AQUA20: A benchmark dataset for underwater species classification under challenging conditions.Arabian Jour- nal for Science and Engineering, 2026

  2. [2]

    Laradji, Dmitry A

    Alzayat Saleh, Issam H. Laradji, Dmitry A. Konovalov, Michael Bradley, David Vazquez, and Marcus Sheaves. Computer vision and deep learning for fish classifica- tion in underwater habitats: A survey.Fish and Fisheries, 23:977–999, 2022

  3. [3]

    Deep learn- ing and the oceans.Computer, 55(5):39–50, 2022

    Marko Radeta, Agustin Zuniga, Naser Hos- sein Motlagh, Mohan Liyanage, Ruben Fre- itas, Moustafa Youssef, Sasu Tarkoma, Hu- ber Flores, and Petteri Nurmi. Deep learn- ing and the oceans.Computer, 55(5):39–50, 2022

  4. [4]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth´ ee Darcet, Th´ eo Moutakanni, et al. DINOv2: Learning ro- bust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  5. [5]

    DINOv3

    Oriane Sim´ eoni, Huy V. Vo, Maximil- ian Seitzer, Federico Baldassarre, Maxime Oquab, et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025

  6. [6]

    Label-efficient un- derwater species classification with semi- supervised learning on frozen foundation model embeddings, 2026

    Thomas Manuel Rost. Label-efficient un- derwater species classification with semi- supervised learning on frozen foundation model embeddings, 2026

  7. [7]

    LLM neuroanatomy: How I topped the LLM leaderboard without changing a single weight

    David Noel Ng. LLM neuroanatomy: How I topped the LLM leaderboard without changing a single weight. Blog post, 2026. 13

  8. [8]

    Learning from labeled and unlabeled data with label propagation.Technical Report CMU-CALD-02-107, 2002

    Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation.Technical Report CMU-CALD-02-107, 2002

  9. [9]

    Pseudo-label: The sim- ple and efficient semi-supervised learning method for deep neural networks

    Dong-Hyun Lee. Pseudo-label: The sim- ple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, 2013

  10. [10]

    Unsupervised word sense disambiguation rivaling supervised meth- ods

    David Yarowsky. Unsupervised word sense disambiguation rivaling supervised meth- ods. InACL, 1995

  11. [11]

    Fisher, Yun-Heh Chen-Burger, Daniela Giordano, Lynda Hardman, and Fang-Pang Lin.Fish4Knowledge: Collect- ing and Analyzing Massive Coral Reef Fish Video Data

    Robert B. Fisher, Yun-Heh Chen-Burger, Daniela Giordano, Lynda Hardman, and Fang-Pang Lin.Fish4Knowledge: Collect- ing and Analyzing Massive Coral Reef Fish Video Data. Springer, 2016

  12. [12]

    Unlocking the potential of deep learning for marine ecology: Overview, applications, and outlook.ICES Journal of Marine Science, 79(2):319–336, 2022

    Morten Goodwin, Kim Tallaksen Halvorsen, Lei Jiao, et al. Unlocking the potential of deep learning for marine ecology: Overview, applications, and outlook.ICES Journal of Marine Science, 79(2):319–336, 2022

  13. [13]

    A cost-effective video system for a rapid appraisal of deep-sea ben- thic habitats: The Azor drift-cam.Meth- ods in Ecology and Evolution, 12:1379–1388, 2021

    Carlos Dominguez-Carri´ o, Joan Llu´ ıs Riera, Katleen Robert, Mikel Zabala, Susana Re- quena, Josep-Maria Gili, Jordi Griny´ o, Co- vadonga Orejas, Claudio Lo Iacono, En- rique Isla, Alejandra Londo˜ no-Burbano, and Telmo Morato. A cost-effective video system for a rapid appraisal of deep-sea ben- thic habitats: The Azor drift-cam.Meth- ods in Ecology an...

  14. [14]

    Phani Jayanth

    Sparsh Mittal, Srishti Srivastava, and J. Phani Jayanth. A survey of deep learn- ing techniques for underwater image classi- fication.IEEE Transactions on Neural Net- works and Learning Systems, 34(10):6968– 6982, 2023

  15. [15]

    Emerging properties in self-supervised vision trans- formers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´ e J´ egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision trans- formers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), 2021

  16. [16]

    An image is worth 16x16 words: Transformers for image recog- nition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexan- der Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recog- nition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  17. [17]

    Vision transformers for zero-shot clustering of animal images: A comparative benchmarking study

    Hugo Markoff, Stefan Hein Bengtson, and Michael Ørsted. Vision transformers for zero-shot clustering of animal images: A comparative benchmarking study. arXiv preprint arXiv:2602.03894, 2026

  18. [18]

    Multi-label plant species classification with self-supervised vi- sion transformers

    Murilo Gustineli et al. Multi-label plant species classification with self-supervised vi- sion transformers. InCLEF 2024 Working Notes, 2024

  19. [19]

    Mitigating Domain Drift in Multi Species Segmentation with DINOv2: A Cross-Domain Evaluation in Herbicide Research Trials

    Artzai Picon et al. Robust multi- species agricultural segmentation across de- vices, seasons, and sensors using hierar- chical DINOv2 models.arXiv preprint arXiv:2508.07514, 2026

  20. [20]

    Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

    Sangmin Ying et al. Relaxed recursive transformers: Effective parameter sharing with layer-wise LoRA. arXiv preprint arXiv:2410.20672, 2024. 14 A Additional Results A.1 Global accuracy Figure 7 shows the global accuracy compari- son, complementing the macro F1 results in Fig- ure 4. The pattern is consistent: circuit dupli- cation improves over the baseli...