pith. machine review for the scientific record. sign in

arxiv: 2605.11771 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Revisiting Shadow Detection from a Vision-Language Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords shadow detectionvision-languagesemantic referencedense predictionglobal-to-local couplingparameter-efficient learningambiguous scenes
0
0 comments X

The pith

Shadow detection gains robustness by aligning global image features with shadow-related language embeddings to resolve visual ambiguities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Shadow detection models that depend only on pixel-level visual cues often fail when dark regions could be cast shadows or naturally dark surfaces. This paper argues that adding language as an explicit semantic reference supplies the missing disambiguation signal. The proposed SVL framework aligns the full image representation with text embeddings that describe shadows by training on a scene-level shadow ratio regression task. Global guidance then propagates to dense local predictions through a coupling step and patch-level text constraints. The resulting lightweight model, using a frozen image encoder, achieves strong benchmark results and better handling of ambiguous cases.

Core claim

The paper establishes that robust shadow detection requires an explicit semantic reference from language beyond visual cues alone; SVL achieves this by aligning the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, then transfers the guidance to dense inference via global-to-local coupling while applying local patch-level text constraints, producing improved performance and robustness under visually ambiguous conditions.

What carries the argument

SVL, the Shadow Vision-Language framework that aligns global image representations with shadow text embeddings via scene-level shadow ratio regression and transfers guidance through global-to-local coupling plus local text constraints.

If this is right

  • Strong overall performance is achieved across multiple standard shadow detection benchmarks.
  • Robustness increases specifically in visually ambiguous conditions where visual cues alone are unreliable.
  • The design remains parameter-efficient by training less than 1 percent of parameters on top of a frozen image encoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global-to-local coupling pattern could be tested in other dense prediction tasks that involve appearance ambiguities, such as distinguishing reflections from objects.
  • Performance may vary with the choice of text embedding model, suggesting experiments that swap different language encoders while keeping the regression objective fixed.
  • Reducing reliance on purely visual supervision through language references could lower annotation costs for training shadow detectors in new environments.

Load-bearing premise

Shadow-related text embeddings supply a reliable semantic reference capable of disambiguating cast shadows from intrinsically dark surfaces when visual evidence alone is insufficient.

What would settle it

A controlled evaluation on images where dark surfaces are visually similar to shadows but semantically distinct in text embeddings, measuring whether SVL accuracy drops to the level of a vision-only baseline.

Figures

Figures reproduced from arXiv: 2605.11771 by Hao Feng, Houqiang Li, Wengang Zhou, Yonghui Wang.

Figure 1
Figure 1. Figure 1: Progressive ambiguity in shadow detection. (a) Appearance-driven: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of SVL. A frozen vision encoder extracts multi-level [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed vision–semantic consistency learning. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with representative shadow detection methods. SwinS and SAdapter are short for SwinShadow and ShadowAdapter, respectively. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with representative shadow detection methods on more challenging cases with shadow–non-shadow ambiguity. From top to [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis of the low-level skip feature. We report [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Failure cases of SVL on thin or entangled shadow patterns. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes SVL, a Shadow Vision-Language framework for shadow detection that treats language embeddings as an explicit semantic reference to disambiguate cast shadows from intrinsically dark surfaces. It aligns a global image representation (from frozen DINOv3) with shadow-related text embeddings via a scene-level shadow ratio regression objective, transfers guidance to dense predictions through global-to-local coupling, and adds local patch-level text constraints, all while training only lightweight modules (<1% parameters) and reporting strong results on standard benchmarks plus dedicated hard-case evaluations.

Significance. If the central claims hold, the work offers a parameter-efficient route to injecting semantic priors from language into dense vision tasks, with potential for improved robustness precisely where visual cues are ambiguous. The frozen-encoder design and explicit global-to-local transfer mechanism are practical strengths that could generalize beyond shadow detection.

major comments (1)
  1. [SVL framework (abstract and §3)] The scene-level shadow ratio regression objective (described in the abstract and framework overview) aligns the global image representation to text embeddings by regressing a scalar ratio. This formulation can be satisfied by a direct mapping from DINOv3 features to the ratio without the text embeddings necessarily supplying independent semantic distinctions between cast shadows and dark surfaces; the subsequent global-to-local coupling and local patch constraints then inherit this ambiguity. An ablation that isolates the contribution of the text embeddings (e.g., replacing them with a learned scalar target) is required to substantiate the claim that language provides semantic disambiguation beyond additional global supervision.
minor comments (1)
  1. [Abstract] The abstract states strong benchmark performance and improved robustness under ambiguous conditions but omits any numerical metrics, ablation tables, or error breakdowns; including at least headline numbers and a brief ablation summary would strengthen the presentation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the SVL framework. The concern that the scene-level shadow ratio regression might be satisfiable without the text embeddings contributing independent semantic distinctions is valid and merits direct verification. We will add the requested ablation in the revised manuscript.

read point-by-point responses
  1. Referee: [SVL framework (abstract and §3)] The scene-level shadow ratio regression objective (described in the abstract and framework overview) aligns the global image representation to text embeddings by regressing a scalar ratio. This formulation can be satisfied by a direct mapping from DINOv3 features to the ratio without the text embeddings necessarily supplying independent semantic distinctions between cast shadows and dark surfaces; the subsequent global-to-local coupling and local patch constraints then inherit this ambiguity. An ablation that isolates the contribution of the text embeddings (e.g., replacing them with a learned scalar target) is required to substantiate the claim that language provides semantic disambiguation beyond additional global supervision.

    Authors: We agree that an explicit ablation is needed to isolate whether the fixed language embeddings supply semantic distinctions beyond what a scalar regression target could achieve. In the revised manuscript we will add this experiment: we replace the text embeddings with a learned scalar target (while keeping the same regression loss and global-to-local coupling) and report performance on both standard benchmarks and the hard-case subset. We expect the language version to retain an advantage on ambiguous cases because the text embeddings are derived from semantic prompts that encode distinctions such as “cast shadow” versus “intrinsically dark surface,” providing a fixed directional prior in embedding space that a scalar cannot. The global-to-local coupling then transfers this prior rather than a purely numeric signal. We will also clarify in §3 that the alignment objective operates in the joint vision-language space rather than as a simple scalar predictor. revision: yes

Circularity Check

0 steps flagged

SVL derivation is self-contained; no load-bearing steps reduce to fitted inputs or self-citations by construction

full rationale

The paper introduces SVL as a new framework that aligns global image features to shadow-related text embeddings via a scene-level shadow ratio regression objective, then transfers guidance through global-to-local coupling and applies local patch-level text constraints. These components are defined directly in the manuscript without reference to prior fitted parameters from the same authors or equations that equate the claimed outputs to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes imported from prior work appear in the derivation. The shadow ratio serves as standard supervision from annotations rather than a renamed prediction, and the text embeddings function as an external semantic reference rather than a tautological target. Experiments on benchmarks provide external validation, keeping the central claims independent of internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that language embeddings provide useful semantic disambiguation for visual shadow detection, plus standard assumptions about the quality of frozen DINOv3 features and the validity of regression-based alignment objectives. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Shadow-related text embeddings supply semantic information that can reliably disambiguate cast shadows from intrinsically dark surfaces when visual cues are ambiguous.
    Invoked as the core justification for using language as an explicit reference in the SVL framework.

pith-pipeline@v0.9.0 · 5550 in / 1306 out tokens · 27966 ms · 2026-05-13T05:56:50.643818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 2 internal anchors

  1. [1]

    Shadow detection: A survey and comparative evaluation of recent methods,

    A. Sanin, C. Sanderson, and B. C. Lovell, “Shadow detection: A survey and comparative evaluation of recent methods,”Pattern Recognition, vol. 45, no. 4, pp. 1684–1695, 2012

  2. [2]

    Unveiling deep shadows: A survey and benchmark on image and video shadow detection, removal, and generation in the deep learning era,

    X. Hu, Z. Xing, T. Wang, C.-W. Fu, and P.-A. Heng, “Unveiling deep shadows: A survey and benchmark on image and video shadow detection, removal, and generation in the deep learning era,”arXiv preprint arXiv:2409.02108, 2024

  3. [3]

    A survey on shadow detection and removal in images and video sequences,

    A. Tiwari, P. K. Singh, and S. Amin, “A survey on shadow detection and removal in images and video sequences,” inProceedings of the IEEE International Conference on Cloud Security and Big Data Engineering, 2016, pp. 518–523

  4. [4]

    Multi-task learning for automotive foggy scene understanding via domain adaptation to an illumination-invariant representation,

    N. Alshammari, S. Akc ¸ay, and T. P. Breckon, “Multi-task learning for automotive foggy scene understanding via domain adaptation to an illumination-invariant representation,”arXiv preprint arXiv:1909.07697, 2019

  5. [5]

    Lanevil: Benchmarking the robustness of lane detection to environmental illusions,

    T. Zhang, L. Wang, H. Li, Y . Xiao, S. Liang, A. Liu, X. Liu, and D. Tao, “Lanevil: Benchmarking the robustness of lane detection to environmental illusions,” inProceedings of the ACM International Conference on Multimedia, 2024, pp. 5403–5412

  6. [6]

    Autoremover: Automatic object removal for autonomous driving videos,

    R. Zhang, W. Li, P. Wang, C. Guan, J. Fang, Y . Song, J. Yu, B. Chen, W. Xu, and R. Yang, “Autoremover: Automatic object removal for autonomous driving videos,” inProceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 12 853–12 861

  7. [7]

    Shadow removal for enhanced nighttime driving scene generation

    H. Yang, O.-H. Choung, and Y . Ban, “Shadow removal for enhanced nighttime driving scene generation.”Applied Sciences, vol. 14, no. 23, p. 10999, 2024

  8. [8]

    Wip: Shadow hack: Adversarial shadow attack against lidar object detection,

    R. Kobayashi, K. Nomoto, Y . Tanaka, G. Tsuruoka, and T. Mori, “Wip: Shadow hack: Adversarial shadow attack against lidar object detection,” inProceedings of the Symposium on Vehicle Security and Privacy, 2024, pp. 1–7

  9. [9]

    Moving shadow and object detection in traffic scenes,

    I. Mikic, P. C. Cosman, G. T. Kogut, and M. M. Trivedi, “Moving shadow and object detection in traffic scenes,” inProceedings of the International Conference on Pattern Recognition, 2000, pp. 321–324

  10. [10]

    Skeprid: Pose and illumination change-resistant skeleton-based person re-identification,

    T. Yu, H. Jin, W.-T. Tan, and K. Nahrstedt, “Skeprid: Pose and illumination change-resistant skeleton-based person re-identification,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 14, no. 4, pp. 1–24, 2018

  11. [11]

    An efficient motion detection and tracking scheme for encrypted surveillance videos,

    J. Guo, P. Zheng, and J. Huang, “An efficient motion detection and tracking scheme for encrypted surveillance videos,”ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 13, no. 4, pp. 1–23, 2017

  12. [12]

    Fast shadow detection from a single image using a patched convolutional neural network,

    S. Hosseinzadeh, M. Shakeri, and H. Zhang, “Fast shadow detection from a single image using a patched convolutional neural network,” in Proceedings of the International Conference on Intelligent Robots and Systems, 2018, pp. 3124–3129

  13. [13]

    Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal,

    J. Wang, X. Li, and J. Yang, “Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1788–1797

  14. [14]

    Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection,

    L. Zhu, Z. Deng, X. Hu, C.-W. Fu, X. Xu, J. Qin, and P.-A. Heng, “Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection,” inProceedings of the European Con- ference on Computer Vision, 2018, pp. 121–136

  15. [15]

    A+d net: Training a shadow detector with adversarial shadow attenuation,

    H. Le, T. F. Y . Vicente, V . Nguyen, M. Hoai, and D. Samaras, “A+d net: Training a shadow detector with adversarial shadow attenuation,” inProceedings of the European Conference on Computer Vision, 2018, pp. 662–678

  16. [16]

    Semantic-aware transformer for shadow detection,

    K. Zhou, J.-L. Fang, W. Wu, Y .-L. Shao, X.-Q. Wang, and D. Wei, “Semantic-aware transformer for shadow detection,”Computer Vision and Image Understanding, vol. 240, no. 1, p. 103941, 2024

  17. [17]

    Shadowadapter: Adapting segment anything model with auto-prompt for shadow detection,

    L. Jie and H. Zhang, “Shadowadapter: Adapting segment anything model with auto-prompt for shadow detection,”Expert Systems with Applications, vol. 273, no. 1, p. 126809, 2025

  18. [18]

    Using color to separate reflection components,

    S. A. Shafer, “Using color to separate reflection components,”Color Research & Application, vol. 10, no. 4, pp. 210–218, 1985

  19. [19]

    A bi-illuminant dichromatic reflection model for understanding images,

    B. A. Maxwell, R. M. Friedhoff, and C. A. Smith, “A bi-illuminant dichromatic reflection model for understanding images,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8

  20. [20]

    Moving cast shadow detection using physics-based features,

    J.-B. Huang and C.-S. Chen, “Moving cast shadow detection using physics-based features,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2310–2317

  21. [21]

    Learning to recognize shadows in monochromatic natural images,

    J. Zhu, K. G. Samuel, S. Z. Masood, and M. F. Tappen, “Learning to recognize shadows in monochromatic natural images,” inProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 223–230

  22. [22]

    Automatic shadow detection and removal from a single image,

    S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri, “Automatic shadow detection and removal from a single image,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 431– 446, 2015

  23. [23]

    Shadow detection in single rgb images using a context preserver convolutional neural network trained by multiple adversarial examples,

    S. Mohajerani and P. Saeedi, “Shadow detection in single rgb images using a context preserver convolutional neural network trained by multiple adversarial examples,”IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 4117–4129, 2019

  24. [24]

    Direction-aware spatial context features for shadow detection,

    X. Hu, L. Zhu, C.-W. Fu, J. Qin, and P.-A. Heng, “Direction-aware spatial context features for shadow detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7454–7462

  25. [25]

    Rmlanet: Random multi-level attention network for shadow detection and removal,

    L. Jie and H. Zhang, “Rmlanet: Random multi-level attention network for shadow detection and removal,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 12, pp. 7819–7831, 2023

  26. [26]

    Distraction-aware shadow detection,

    Q. Zheng, X. Qiao, Y . Cao, and R. W. Lau, “Distraction-aware shadow detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5167–5176

  27. [27]

    Mitigating intensity bias in shadow detection via feature decomposition and reweighting,

    L. Zhu, K. Xu, Z. Ke, and R. W. Lau, “Mitigating intensity bias in shadow detection via feature decomposition and reweighting,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 4702–4711

  28. [28]

    Single image shadow detection via complementary mechanism,

    Y . Zhu, X. Fu, C. Cao, X. Wang, Q. Sun, and Z.-J. Zha, “Single image shadow detection via complementary mechanism,” inProceedings of the ACM International Conference on Multimedia, 2022, pp. 6717–6726

  29. [29]

    Pay more attention to dark regions for faster shadow detection,

    X.-T. Wu, X.-D. Chen, H. Chen, W. Wu, W. Ma, and H. Song, “Pay more attention to dark regions for faster shadow detection,”Computer Vision and Image Understanding, vol. 263, no. 1, p. 104589, 2026

  30. [30]

    Under the shadow: Exploiting opacity variation for fine-grained shadow detection,

    X. Qiao, K. Xu, X. Yang, R. Dong, X. Xia, and J. Cui, “Under the shadow: Exploiting opacity variation for fine-grained shadow detection,” inProceedings of the Advances in Neural Information Processing Systems, 2025, pp. 1–13

  31. [31]

    Large scale shadow annotation and detection using lazy annotation and stacked cnns,

    L. Hou, T. F. Y . Vicente, M. Hoai, and D. Samaras, “Large scale shadow annotation and detection using lazy annotation and stacked cnns,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1337–1351, 2019

  32. [32]

    Revisiting shadow detection: A new benchmark dataset for complex world,

    X. Hu, T. Wang, C.-W. Fu, Y . Jiang, Q. Wang, and P.-A. Heng, “Revisiting shadow detection: A new benchmark dataset for complex world,”IEEE Transactions on Image Processing, vol. 30, no. 1, pp. 1925–1934, 2021

  33. [33]

    Learning from synthetic shadows for shadow detection and removal,

    N. Inoue and T. Yamasaki, “Learning from synthetic shadows for shadow detection and removal,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 11, pp. 4187–4197, 2020

  34. [34]

    Shadow generation for composite image in real-world scenes,

    Y . Hong, L. Niu, and J. Zhang, “Shadow generation for composite image in real-world scenes,” inProceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 914–922. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 1, NO. 1, APRIL 2026 12

  35. [35]

    A multi-task mean teacher for semi-supervised shadow detection,

    Z. Chen, L. Zhu, L. Wan, S. Wang, W. Feng, and P.-A. Heng, “A multi-task mean teacher for semi-supervised shadow detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 5611–5620

  36. [36]

    Silt: Shadow-aware iterative label tuning for learning to detect shadows from noisy labels,

    H. Yang, T. Wang, X. Hu, and C.-W. Fu, “Silt: Shadow-aware iterative label tuning for learning to detect shadows from noisy labels,” in Proceedings of the IEEE International Conference on Computer Vision, 2023, pp. 12 687–12 698

  37. [37]

    A self-correction algorithm for transparent object shadow detection,

    J. Li, S. Wen, R. Chen, D. Lu, J. Hu, and H. Zhang, “A self-correction algorithm for transparent object shadow detection,”Applied Intelligence, vol. 55, no. 4, p. 275, 2025

  38. [38]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  39. [39]

    Detect any shadow: Segment anything for video shadow detection,

    Y . Wang, W. Zhou, Y . Mao, and H. Li, “Detect any shadow: Segment anything for video shadow detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 5, pp. 3782–3794, 2023

  40. [40]

    Swinshadow: Shifted window for ambiguous adjacent shadow detection,

    Y . Wang, S. Liu, L. Li, W. Zhou, and H. Li, “Swinshadow: Shifted window for ambiguous adjacent shadow detection,”ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 20, no. 11, pp. 1–20, 2024

  41. [41]

    Structure-aware transformer for shadow detection,

    W. Sun, L. Xiang, and W. Zhao, “Structure-aware transformer for shadow detection,”IET Image Processing, vol. 19, no. 1, p. e70031, 2025

  42. [42]

    Make seg- ment anything model perfect on shadow detection,

    X.-D. Chen, W. Wu, W. Yang, H. Qin, X. Wu, and X. Mao, “Make seg- ment anything model perfect on shadow detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, no. 1, pp. 1–13, 2023

  43. [43]

    Omni-supervised shadow detection with vision foundation model,

    Z. Qian, W. Wu, X.-T. Wu, and X.-D. Chen, “Omni-supervised shadow detection with vision foundation model,”Journal of Visual Communi- cation and Image Representation, vol. 100, no. 1, p. 104146, 2024

  44. [44]

    Language- driven interactive shadow detection,

    H. Wang, W. Wang, H. Zhou, H. Xu, S. Wu, and L. Zhu, “Language- driven interactive shadow detection,” inProceedings of the ACM Inter- national Conference on Multimedia, 2024, pp. 5527–5536

  45. [45]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning, 2021, pp. 8748–8763

  46. [46]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProceedings of the International Conference on Machine Learning, 2022, pp. 12 888–12 900

  47. [47]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .- H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProceedings of the International Conference on Machine Learning, 2021, pp. 4904– 4916

  48. [48]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inProceedings of the Advances in Neural Information Processing Systems, 2022, pp. 23 716–23 736

  49. [49]

    Denseclip: Language-guided dense prediction with context-aware prompting,

    Y . Rao, W. Zhao, G. Chen, Y . Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 082–18 091

  50. [50]

    Open-vocabulary object detection via vision and language knowledge distillation,

    X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation,”arXiv preprint arXiv:2104.13921, 2021

  51. [51]

    Open vocabulary semantic segmentation with patch aligned contrastive learning,

    J. Mukhoti, T.-Y . Lin, O. Poursaeed, R. Wang, A. Shah, P. H. Torr, and S.-N. Lim, “Open vocabulary semantic segmentation with patch aligned contrastive learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 413–19 423

  52. [52]

    Conditional prompt learning for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 816–16 825

  53. [53]

    Grounded language-image pre- training,

    L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre- training,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975

  54. [54]

    Vlt: Vision-language transformer and query generation for referring segmentation,

    H. Ding, C. Liu, S. Wang, and X. Jiang, “Vlt: Vision-language transformer and query generation for referring segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7900–7916, 2022

  55. [55]

    Segmentation from natural language expressions,

    R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural language expressions,” inProceedings of the European Conference on Computer Vision, 2016, pp. 108–124

  56. [56]

    Cross-modal self-attention network for referring image segmentation,

    L. Ye, M. Rochan, Z. Liu, and Y . Wang, “Cross-modal self-attention network for referring image segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 502–10 511

  57. [57]

    Language-driven semantic segmentation,

    B. Li, K. Q. Weinberger, S. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,”arXiv preprint arXiv:2201.03546, 2022

  58. [58]

    Refer- ring image segmentation via cross-modal progressive comprehension,

    S. Huang, T. Hui, S. Liu, G. Li, Y . Wei, J. Han, L. Liu, and B. Li, “Refer- ring image segmentation via cross-modal progressive comprehension,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 488–10 497

  59. [59]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Proceedings of the Advances in Neural Information Processing Systems, 2023, pp. 34 892–34 916

  60. [60]

    arXiv preprint arXiv:1610.02242 , year=

    S. Laine and T. Aila, “Temporal ensembling for semi-supervised learn- ing,”arXiv preprint arXiv:1610.02242, 2016

  61. [61]

    Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,

    A. Tarvainen and H. Valpola, “Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,” inProceedings of the Advances in Neural Information Processing Systems, 2017, pp. 1195–1204

  62. [62]

    Dual contrastive prediction for incomplete multi-view representation learning,

    Y . Lin, Y . Gou, X. Liu, J. Bai, J. Lv, and X. Peng, “Dual contrastive prediction for incomplete multi-view representation learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4447–4461, 2022

  63. [63]

    Self-ensembling for visual domain adaptation,

    G. French, M. Mackiewicz, and M. Fisher, “Self-ensembling for visual domain adaptation,”arXiv preprint arXiv:1706.05208, 2017

  64. [64]

    Semantically consistent multi- view representation learning,

    Y . Zhou, Q. Zheng, S. Bai, and J. Zhu, “Semantically consistent multi- view representation learning,”Knowledge-Based Systems, vol. 278, no. 1, p. 110899, 2023

  65. [65]

    Cross-modal retrieval with partially mismatched pairs,

    P. Hu, Z. Huang, D. Peng, X. Wang, and X. Peng, “Cross-modal retrieval with partially mismatched pairs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 9595–9610, 2023

  66. [66]

    Semi-supervised semantic segmentation with cross pseudo supervision,

    X. Chen, Y . Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 2613–2622

  67. [67]

    Semi-supervised semantic segmentation with prototype-based consistency regularization,

    H. Xu, L. Liu, Q. Bian, and Z. Yang, “Semi-supervised semantic segmentation with prototype-based consistency regularization,” inPro- ceedings of the Advances in Neural Information Processing Systems, 2022, pp. 26 007–26 020

  68. [68]

    Vision-language models for vision tasks: A survey,

    J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5625–5644, 2024

  69. [69]

    Pyramid scene parsing network,

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890

  70. [70]

    A stagewise refinement model for detecting salient objects in images,

    T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise refinement model for detecting salient objects in images,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4019–4028

  71. [71]

    Reverse attention for salient ob- ject detection,

    S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient ob- ject detection,” inProceedings of the European Conference on Computer Vision, 2018, pp. 234–250

  72. [72]

    Eg- net: Edge guidance network for salient object detection,

    J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y . Cao, J. Yang, and M.-M. Cheng, “Eg- net: Edge guidance network for salient object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8779–8788

  73. [73]

    Interactive two- stream decoder for accurate and fast saliency detection,

    H. Zhou, X. Xie, J.-H. Lai, Z. Chen, and L. Yang, “Interactive two- stream decoder for accurate and fast saliency detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9141–9150

  74. [74]

    Single-image shadow detection and removal using paired regions,

    R. Guo, Q. Dai, and D. Hoiem, “Single-image shadow detection and removal using paired regions,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2033–2040

  75. [75]

    Densely cascaded shadow detection network via deeply supervised parallel fu- sion

    Y . Wang, X. Zhao, Y . Li, X. Hu, K. Huang, and N. CRIPAC, “Densely cascaded shadow detection network via deeply supervised parallel fu- sion.” inProceedings of the International Joint Conferences on Artificial Intelligence, 2018, p. 6

  76. [76]

    Fine-context shadow detection using shadow removal,

    J. M. J. Valanarasu and V . M. Patel, “Fine-context shadow detection using shadow removal,” inProceedings of the IEEE Winter Conference on Applications of Computer Vision, 2023, pp. 1705–1714

  77. [77]

    Large- scale training of shadow detectors with noisily-annotated shadow exam- ples,

    T. F. Y . Vicente, L. Hou, C.-P. Yu, M. Hoai, and D. Samaras, “Large- scale training of shadow detectors with noisily-annotated shadow exam- ples,” inProceedings of the European Conference on Computer Vision, 2016, pp. 816–832

  78. [78]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 1, NO. 1, APRIL 2026 13

  79. [79]

    An overview of gradient descent optimization algorithms,

    S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016

  80. [80]

    Efficient inference in fully connected crfs with gaussian edge potentials,

    P. Kr ¨ahenb¨uhl and V . Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” inProceedings of the Advances in Neural Information Processing Systems, 2011, pp. 109–117