pith. machine review for the scientific record. sign in

arxiv: 2604.16758 · v1 · submitted 2026-04-18 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Frozen Vision Transformers for Dense Prediction on Small Datasets: A Case Study in Arrow Localization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords frozen vision transformersdense predictionsmall datasetsarrow localizationself-supervised learningfeature upsamplingarchery target analysiscanonical rectification
0
0 comments X

The pith

A frozen self-supervised vision transformer localizes arrow punctures on archery targets to 1.41 mm accuracy using only 48 training images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how a pre-trained vision transformer can be frozen and paired with light task-specific parts to solve a precise measurement task from very few labeled examples. A color rectification step first converts each photo into a standard view so that pixels map directly to physical distances on the target. A DINOv3 ViT-L/16 model then supplies features that AnyUp upsampling turns back into sub-millimeter heatmaps, while only 3.8 million parameters are updated during training. The resulting system reaches an F1 score of 0.893 and a localization error of 1.41 mm, matching or exceeding earlier methods that needed far larger datasets. An ablation indicates the upsampling step already supplies the spatial detail that an extra offset regression head would normally provide.

Core claim

A pipeline that freezes a DINOv3 ViT-L/16 backbone, adds AnyUp guided upsampling to restore spatial resolution from 32 by 32 patch tokens, and attaches lightweight CenterNet-style heads can detect and localize arrow punctures on 40 cm targets after training on only 48 photographs containing 5,084 punctures. With 3.8 M trainable parameters out of 308 M total, the method yields a mean F1 of 0.893 plus or minus 0.011 and mean localization error of 1.41 plus or minus 0.06 mm across three cross-validation folds, while recovering average arrow scores to a median error of 1.8 percent and group centroids to a median of 4.00 mm.

What carries the argument

Frozen DINOv3 ViT-L/16 backbone combined with AnyUp guided feature upsampling to recover sub-millimeter precision from patch tokens, plus color-based canonical rectification that standardizes perspective views for physical measurement.

If this is right

  • Guided feature upsampling already supplies the spatial precision that an offset regression head normally provides, so the extra head can be omitted.
  • Only 3.8 million parameters need to be trained, leaving the remaining 304 million frozen.
  • Per-image average arrow scores are recovered with a median error of 1.8 percent.
  • Group centroid positions are recovered to a median error of 4.00 mm.
  • The approach matches fully supervised methods that require substantially larger training sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frozen-plus-light-adaptation pattern could be tested on other small-data dense prediction tasks such as defect detection on manufactured parts or tracking in sports video.
  • Performance may depend on having a reliable domain-specific rectification step; without it the model would need to learn perspective correction from the limited labels.
  • Alternative self-supervised backbones or different upsampling operators could be swapped in to test whether DINOv3 and AnyUp are uniquely effective here.
  • The low parameter count suggests the pipeline could run on modest hardware for real-time coaching feedback during archery sessions.

Load-bearing premise

The color-based canonical rectification stage reliably maps perspective-distorted photographs into a standardized coordinate system where pixel distances correspond to known physical measurements.

What would settle it

Retraining and testing the same frozen backbone on a new collection of target photographs captured under changed lighting or with non-standard color patterns, then measuring whether localization error rises above 2 mm without the rectification step, would show whether the frozen model alone suffices.

Figures

Figures reproduced from arXiv: 2604.16758 by Maxwell Shepherd.

Figure 1
Figure 1. Figure 1: A 40 cm indoor archery target face with arrow punctures across the scoring rings. downstream computation of grouping statistics, di￾rectional bias, and other spatial metrics useful for skill development. The present evaluation is limited to a single target type from a single venue; extension to other target faces or shooting environments would require additional validation. This task presents several chall… view at source ↗
Figure 2
Figure 2. Figure 2: shows the result of this process. Model Architecture The detection model has three stages: frozen feature extraction, guided up￾sampling, and task-specific prediction heads [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model architecture. Blue blocks are frozen [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training and validation loss (top) and val [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study comparing detection met￾rics (left) and mean localization error (right) with and without offset refinement. challenging. Downstream Archery Metrics To evaluate the system’s utility for its intended application, we com￾pute two archery-specific metrics across all valida￾tion images ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results on a validation image. Left: predicted heatmap overlay showing confident [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Predicted vs. ground-truth average arrow [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

We present a system for automated detection, localization, and scoring of arrow punctures on 40\,cm indoor archery target faces, trained on only 48 annotated photographs (5{,}084 punctures). Our pipeline combines three components: a color-based canonical rectification stage that maps perspective-distorted photographs into a standardized coordinate system where pixel distances correspond to known physical measurements; a frozen self-supervised vision transformer (DINOv3 ViT-L/16) paired with AnyUp guided feature upsampling to recover sub-millimeter spatial precision from $32 \times 32$ patch tokens; and lightweight CenterNet-style detection heads for arrow-center heatmap prediction. Only 3.8\,M of 308\,M total parameters are trainable. Across three cross-validation folds, we achieve a mean F1 score of $0.893 \pm 0.011$ and a mean localization error of $1.41 \pm 0.06$\,mm, comparable to or better than prior fully-supervised approaches that require substantially more training data. An ablation study shows that the CenterNet offset regression head, typically essential for sub-pixel refinement, provides negligible detection improvement while degrading localization in our setting. This suggests that guided feature upsampling already resolves the spatial precision lost through patch tokenization. On downstream archery metrics, the system recovers per-image average arrow scores with a median error of 1.8\% and group centroid positions to within a median of 4.00\,mm. These results demonstrate that frozen foundation models with minimal task-specific adaptation offer a practical paradigm for dense prediction in small-data regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a pipeline for automated detection, localization, and scoring of arrow punctures on 40 cm archery targets using only 48 annotated images (5,084 punctures). It combines a color-based canonical rectification stage, a frozen DINOv3 ViT-L/16 backbone with AnyUp guided upsampling, and lightweight CenterNet-style heads, training just 3.8M of 308M parameters. Cross-validation yields mean F1 of 0.893 ± 0.011 and localization error of 1.41 ± 0.06 mm, with downstream median errors of 1.8% on scores and 4 mm on centroids; an ablation indicates the offset head adds little value.

Significance. If the rectification is reliable, the work would demonstrate that frozen foundation models with minimal adaptation can deliver competitive dense-prediction performance in small-data regimes, supported by cross-validation metrics with standard deviations and an ablation study. This offers a practical alternative to fully supervised training on large datasets for specialized applications.

major comments (1)
  1. [Abstract] Abstract (and methods description of the color-based canonical rectification stage): all headline metrics are reported in physical units (1.41 mm localization error, 4 mm centroid, 1.8% score error), yet no quantitative validation, accuracy metrics, failure-case analysis, or comparison against ground-truth physical measurements is provided for the rectification. If rectification error is comparable to or larger than 1.41 mm, the physical-unit claims lose empirical grounding and the small-data performance argument cannot be evaluated.
minor comments (2)
  1. The ablation study is summarized but lacks a table or explicit numerical comparison (F1 and localization error with vs. without the offset head) that would allow readers to assess the claim that guided upsampling already resolves spatial precision.
  2. Implementation details for AnyUp integration, the exact form of the CenterNet heads, and the rectification algorithm (e.g., color calibration procedure) are referenced but not fully specified, limiting reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. The primary concern is the lack of quantitative validation for the color-based canonical rectification stage underlying the physical-unit metrics. We address this point below and commit to revisions that strengthen the empirical support without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and methods description of the color-based canonical rectification stage): all headline metrics are reported in physical units (1.41 mm localization error, 4 mm centroid, 1.8% score error), yet no quantitative validation, accuracy metrics, failure-case analysis, or comparison against ground-truth physical measurements is provided for the rectification. If rectification error is comparable to or larger than 1.41 mm, the physical-unit claims lose empirical grounding and the small-data performance argument cannot be evaluated.

    Authors: We agree that explicit quantitative validation of the rectification would strengthen the grounding of the physical-unit claims. The rectification applies color-based segmentation to detect the target's outer boundary and rings, followed by a homography that maps each image to a canonical coordinate frame in which the 40 cm target diameter is fixed, enabling direct pixel-to-mm conversion. All reported localization and scoring errors are computed in this rectified frame against human annotations. The low cross-validation standard deviation on localization error (0.06 mm) provides indirect evidence of rectification consistency, as inconsistent rectification would inflate variance. Nevertheless, we acknowledge the referee's point is valid and will revise the manuscript to add: (i) a precise algorithmic description of the color segmentation and homography steps, (ii) quantitative metrics such as the mean and standard deviation of the estimated target diameter (in pixels) across the 48 images and the average corner reprojection error of the fitted homography, and (iii) a short failure-case analysis (e.g., images with extreme lighting or partial occlusion) together with their frequency and impact on downstream metrics. These additions will be placed in a new subsection of Methods and will demonstrate that rectification error is well below the 1.41 mm localization figure, thereby preserving the validity of the physical-unit results. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical pipeline

full rationale

The paper presents an empirical system for arrow localization using a frozen DINOv3 ViT with AnyUp upsampling and lightweight detection heads, trained on 48 images and evaluated via cross-validation folds against external annotations. No mathematical derivations, equations, or predictions are claimed that reduce to fitted parameters or inputs by construction. The color-based rectification is described as a preprocessing stage but is not derived from or equivalent to the model outputs; all reported metrics (F1, mm errors) are measured directly on held-out data. This matches the reader's assessment of no self-referential reductions, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No new free parameters, axioms, or invented entities are introduced; the work relies on standard pre-trained models, existing upsampling techniques, and conventional detection heads without additional postulates.

pith-pipeline@v0.9.0 · 5589 in / 1036 out tokens · 40621 ms · 2026-05-10T08:01:49.918323+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Objects as points,

    X. Zhou, D. Wang, and P. Kr¨ ahenb¨ uhl, “Objects as points,”arXiv preprint arXiv:1904.07850, 2019

  2. [2]

    DINOv2: Learning robust vi- sual features without supervision,

    M. Oquabet al., “DINOv2: Learning robust vi- sual features without supervision,”Trans. Mach. Learn. Res., 2024

  3. [3]

    DINOv3

    O. Sim´ eoniet al., “DINOv3,”arXiv preprint arXiv:2508.10104, 2025

  4. [4]

    AnyUp: Universal feature up- sampling,

    T. Wimmer, P. Truong, M.-J. Rakotosaona, M. Oechsle, F. Tombari, B. Schiele, and J. E. Lenssen, “AnyUp: Universal feature up- sampling,” inProc. Int. Conf. Learn. Represent. (ICLR), 2026

  5. [5]

    Archery score analysis system for outdoor environments,

    S. Kim, J. Moon, and E. C. Lee, “Archery score analysis system for outdoor environments,” Proc. Inst. Mech. Eng., Part P: J. Sports Eng. Technol., 2025

  6. [6]

    Rulebook, Book 3: Target Archery,

    World Archery, “Rulebook, Book 3: Target Archery,” World Archery Federation, 2025. [On- line]. Available:https://www.worldarchery. sport/rulebook/article/13

  7. [7]

    Optuna: A next-generation hy- perparameter optimization framework,

    T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hy- perparameter optimization framework,” inProc. ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining, 2019, pp. 2623–2631

  8. [8]

    Albumentations: Fast and flexible image aug- mentations,

    A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, and A. A. Kalinin, “Albumentations: Fast and flexible image aug- mentations,”Information, vol. 11, no. 2, p. 125, 2020

  9. [9]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. Int. Conf. Learn. Represent. (ICLR), 2019

  10. [10]

    The Hungarian method for the assignment problem,

    H. W. Kuhn, “The Hungarian method for the assignment problem,”Naval Research Logistics Quarterly, vol. 2, no. 1–2, pp. 83–97, 1955. 9