Mitigating Domain Drift in Multi Species Segmentation with DINOv2: A Cross-Domain Evaluation in Herbicide Research Trials

Artzai Picon; Carlos Javier Jimenez; Christian Klukas; Daniel Mugica; Eric White; Gabriel Do-Lago-Junqueira; Itziar Eguskiza; Javier Romero; Ramon Navarra-Mestre

arxiv: 2508.07514 · v4 · submitted 2025-08-11 · 💻 cs.CV · cs.AI

Mitigating Domain Drift in Multi Species Segmentation with DINOv2: A Cross-Domain Evaluation in Herbicide Research Trials

Artzai Picon , Itziar Eguskiza , Daniel Mugica , Javier Romero , Carlos Javier Jimenez , Eric White , Gabriel Do-Lago-Junqueira , Christian Klukas

show 1 more author

Ramon Navarra-Mestre

This is my paper

Pith reviewed 2026-05-19 00:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords semantic segmentationdomain shiftfoundation modelsplant phenotypingherbicide trialsDINOv2hierarchical classificationagricultural imaging

0 comments

The pith

DINOv2 backbone with hierarchical taxonomy lifts species segmentation F1 from 0.52 to 0.87 and holds gains under geographic and drone shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a segmentation system built on the DINOv2 vision foundation model plus hierarchical taxonomic inference can keep working when plant imagery changes across years, countries, cameras, and platforms. Models were trained on a multi-year collection of German and Spanish herbicide trials covering 14 species and four damage types, then evaluated on 2023 updates, transfer to US sites, and 2024 drone flights. The foundation-model approach produced large accuracy gains over earlier baselines at every level of shift, while the hierarchy supplied usable coarser predictions when species labels became unreliable. The result matters for herbicide phenotyping because operational pipelines need models that do not require fresh labels or retraining for each new region or sensor.

Core claim

When trained on the 2018-2020 German-Spanish dataset, the DINOv2 backbone reaches species-level F1 of 0.87 on held-out in-distribution images, 0.77 under moderate temporal and device shifts, and 0.44 under extreme geographic-plus-sensor shift to US drone imagery, compared with baseline scores of 0.52, 0.24 and 0.14. Hierarchical inference supplies additional robustness, delivering family F1 of 0.68 and class F1 of 0.88 on the aerial imagery where fine-grained species classification drops. Error patterns indicate that severe-shift mistakes arise mainly from vegetation-soil confusion rather than collapse of taxonomic distinctions.

What carries the argument

DINOv2 vision foundation model used as segmentation backbone, paired with hierarchical inference that maps predictions across species, family, and class levels.

If this is right

The same backbone can be deployed directly in multi-region herbicide phenotyping workflows without per-site retraining.
Coarser taxonomic outputs remain informative for trial analysis even when species labels become unreliable under shift.
Primary failures under extreme shift are background confusions rather than loss of plant-type distinctions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The preserved taxonomic structure under large viewpoint and sensor changes suggests that DINOv2 features already encode morphological regularities that match biological hierarchies.
The minimal-adaptation result could extend to other variable agricultural or ecological imaging tasks that currently require repeated labeling campaigns.
Adding a small number of target-domain examples at the family level might further stabilize the extreme-shift regime without retraining the entire model.

Load-bearing premise

The multi-year German-Spanish training set together with a standard biological taxonomy already contains the main variations that will appear in US locations and 2024 drone imagery, so no domain-specific adaptation or extra labeled data at higher taxonomic levels is required.

What would settle it

Labeled drone images collected from a new US site in 2025 that yield species F1 below 0.30 while family F1 remains above 0.65 would support the hierarchy claim; uniform collapse of all taxonomic levels below 0.25 would falsify the robustness benefit.

read the original abstract

Reliable plant species and damage segmentation for herbicide field research trials requires models that can withstand substantial real-world variation across seasons, geographies, devices, and sensing modalities. Most deep learning approaches trained on controlled datasets fail to generalize under these domain shifts, limiting their suitability for operational phenotyping pipelines. This study evaluates a segmentation framework that integrates vision foundation models (DINOv2) with hierarchical taxonomic inference to improve robustness across heterogeneous agricultural conditions. We train on a large, multi-year dataset collected in Germany and Spain (2018-2020), comprising 14 plant species and 4 herbicide damage classes, and assess generalization under increasingly challenging shifts: temporal and device changes (2023), geographic transfer to the United States, and extreme sensor shift to drone imagery (2024). Results show that the foundation-model backbone consistently outperforms prior baselines, improving species-level F1 from 0.52 to 0.87 on in-distribution data and maintaining significant advantages under moderate (0.77 vs. 0.24) and extreme (0.44 vs. 0.14) shift conditions. Hierarchical inference provides an additional layer of robustness, enabling meaningful predictions even when fine-grained species classification degrades (family F1: 0.68, class F1: 0.88 on aerial imagery). Error analysis reveals that failures under severe shift stem primarily from vegetation-soil confusion, suggesting that taxonomic distinctions remain preserved despite background and viewpoint variability. The system is now deployed within BASF's phenotyping workflow for herbicide research trials across multiple regions, illustrating the practical viability of combining foundation models with structured biological hierarchies for scalable, shift-resilient agricultural monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows usable F1 gains for 14-species plant segmentation under geographic and sensor shifts by combining DINOv2 with hierarchical inference, but lacks ablations that separate the backbone from the taxonomy step.

read the letter

The main thing to know is that this work reports concrete improvements in multi-species segmentation for herbicide trials when moving from multi-year German and Spanish training data to US sites and 2024 drone imagery. Species-level F1 rises from 0.52 to 0.87 in-distribution and holds an edge under moderate shift (0.77 vs 0.24) and extreme shift (0.44 vs 0.14), with the hierarchy keeping family and class predictions usable when fine-grained labels degrade. The system is already running inside BASF's phenotyping pipeline, which gives the results some operational weight.

Referee Report

1 major / 2 minor

Summary. The paper evaluates a segmentation framework that combines DINOv2 vision foundation models with hierarchical taxonomic inference for plant species and herbicide damage segmentation. Trained on a multi-year German-Spanish dataset of 14 species and 4 damage classes, the approach is tested on progressively harder held-out shifts (temporal/device changes in 2023, geographic transfer to the US, and extreme drone imagery in 2024). It reports consistent outperformance over prior baselines, with species-level F1 rising from 0.52 to 0.87 in-distribution and retaining advantages under moderate (0.77 vs. 0.24) and extreme (0.44 vs. 0.14) shifts; hierarchical inference is credited with additional robustness at family and class levels, and the system is noted as deployed in BASF phenotyping workflows.

Significance. If the performance gains are attributable to the DINOv2 backbone rather than the hierarchical post-processing alone, the work provides concrete evidence that foundation-model features plus structured biological taxonomies can deliver shift-resilient segmentation for operational agricultural phenotyping without requiring new labeled data for each domain. The multi-year, multi-region training set and explicit cross-domain test protocol strengthen the practical relevance for herbicide research pipelines.

major comments (1)

The central claim attributes consistent outperformance under domain shift to the DINOv2 backbone (species F1 0.87 in-distribution, 0.77 moderate shift, 0.44 extreme shift). However, the manuscript provides no ablation that isolates this contribution from the hierarchical taxonomic inference. No experiment trains DINOv2 with a flat classification head, equips the prior baselines with the same hierarchical post-processing, or quantifies the incremental effect of each component on the reported shift metrics. If the hierarchy alone explains most of the gap versus the 0.24/0.14 baselines, the attribution to the foundation model does not hold. This is load-bearing for the abstract and results claims.

minor comments (2)

Abstract and results sections report point F1 values without error bars, confidence intervals, or statistical significance tests, making it difficult to assess whether the reported gains are robust to training stochasticity or implementation choices.
Training details (optimizer, learning-rate schedule, data augmentation, exact DINOv2 variant and fine-tuning protocol) are not fully specified, which limits reproducibility of the in-distribution and shift results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The major comment raises a valid point about the need for component-wise ablations to support attribution of gains to the DINOv2 backbone. We address this below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The central claim attributes consistent outperformance under domain shift to the DINOv2 backbone (species F1 0.87 in-distribution, 0.77 moderate shift, 0.44 extreme shift). However, the manuscript provides no ablation that isolates this contribution from the hierarchical taxonomic inference. No experiment trains DINOv2 with a flat classification head, equips the prior baselines with the same hierarchical post-processing, or quantifies the incremental effect of each component on the reported shift metrics. If the hierarchy alone explains most of the gap versus the 0.24/0.14 baselines, the attribution to the foundation model does not hold. This is load-bearing for the abstract and results claims.

Authors: We agree that explicit ablations are necessary to rigorously attribute performance gains under domain shift. The manuscript demonstrates that the combined DINOv2 + hierarchical framework outperforms prior baselines (which lack both the foundation-model backbone and hierarchical inference) across all evaluated shifts. However, we acknowledge the absence of controlled experiments that (a) apply DINOv2 features with a flat head or (b) equip the original baselines with the same hierarchical post-processing. In the revised manuscript we will add these ablations, together with a quantitative breakdown of the incremental contribution of each component to species-, family-, and class-level F1 under the temporal, geographic, and extreme sensor shifts. This will strengthen the central claims without altering the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: performance metrics derived from independent held-out shift evaluations

full rationale

The paper describes standard supervised training of a DINOv2-based segmentation model on a multi-year German-Spanish dataset (2018-2020) followed by direct evaluation on temporally, geographically, and sensor-shifted test sets (2023 US, 2024 drone). No equations, fitted parameters, or self-citations are presented that reduce the reported F1 improvements (e.g., species-level 0.87 in-distribution, 0.77 moderate shift) to quantities defined or optimized on the same shift data. Hierarchical taxonomic inference is described as an additional robustness layer, but its contribution is not claimed via any self-referential derivation or uniqueness theorem. The central results remain externally falsifiable measurements on distinct data partitions, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on transferability of general-purpose DINOv2 features to agricultural imagery and on the validity of a fixed biological taxonomy for hierarchical fallback; no new entities are postulated and no parameters are fitted specifically to the target shift conditions.

axioms (2)

domain assumption DINOv2 features transfer effectively to fine-grained plant segmentation without domain-specific pretraining
Invoked when the paper states the foundation-model backbone outperforms baselines under shift
domain assumption Higher taxonomic levels remain predictable when species-level labels degrade
Used to claim family F1 0.68 and class F1 0.88 on aerial imagery

pith-pipeline@v0.9.0 · 5873 in / 1382 out tokens · 44197 ms · 2026-05-19T00:29:57.887874+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

integrates vision foundation models (DINOv2) with hierarchical taxonomic inference to improve robustness across heterogeneous agricultural conditions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

SegRAG augments SAM3 with class-specific point prompts retrieved via DINOv3 features and filtered by ICCD, using TSG at inference to improve open-vocabulary segmentation.
SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

SegRAG is a training-free retrieval-augmented framework that extracts class-specific point prompts from a filtered DINOv3 feature bank to boost SAM3 semantic segmentation performance on standard and agricultural benchmarks.
Inference-Path Optimization via Circuit Duplication in Frozen Visual Transformers for Marine Species Classification
cs.CV 2026-04 unverdicted novelty 6.0

Circuit duplication on frozen DINOv3 embeddings raises macro F1 to 0.875 on AQUA20, within 1.4 points of supervised ConvNeXt, with class-specific circuits helping 75% of species.
Label-efficient underwater species classification with logistic regression on frozen foundation model embeddings
cs.CV 2026-03 accept novelty 4.0

Logistic regression on frozen DINOv3 features achieves 88.5% macro F1 on the AQUA20 marine species benchmark, matching end-to-end supervised models with only 6% of the labels.