arxiv: 2605.11771 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Revisiting Shadow Detection from a Vision-Language Perspective

Yonghui Wang , Wengang Zhou , Hao Feng , Houqiang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords shadow detectionvision-languagesemantic referencedense predictionglobal-to-local couplingparameter-efficient learningambiguous scenes

0 comments

The pith

Shadow detection gains robustness by aligning global image features with shadow-related language embeddings to resolve visual ambiguities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Shadow detection models that depend only on pixel-level visual cues often fail when dark regions could be cast shadows or naturally dark surfaces. This paper argues that adding language as an explicit semantic reference supplies the missing disambiguation signal. The proposed SVL framework aligns the full image representation with text embeddings that describe shadows by training on a scene-level shadow ratio regression task. Global guidance then propagates to dense local predictions through a coupling step and patch-level text constraints. The resulting lightweight model, using a frozen image encoder, achieves strong benchmark results and better handling of ambiguous cases.

Core claim

The paper establishes that robust shadow detection requires an explicit semantic reference from language beyond visual cues alone; SVL achieves this by aligning the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, then transfers the guidance to dense inference via global-to-local coupling while applying local patch-level text constraints, producing improved performance and robustness under visually ambiguous conditions.

What carries the argument

SVL, the Shadow Vision-Language framework that aligns global image representations with shadow text embeddings via scene-level shadow ratio regression and transfers guidance through global-to-local coupling plus local text constraints.

If this is right

Strong overall performance is achieved across multiple standard shadow detection benchmarks.
Robustness increases specifically in visually ambiguous conditions where visual cues alone are unreliable.
The design remains parameter-efficient by training less than 1 percent of parameters on top of a frozen image encoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global-to-local coupling pattern could be tested in other dense prediction tasks that involve appearance ambiguities, such as distinguishing reflections from objects.
Performance may vary with the choice of text embedding model, suggesting experiments that swap different language encoders while keeping the regression objective fixed.
Reducing reliance on purely visual supervision through language references could lower annotation costs for training shadow detectors in new environments.

Load-bearing premise

Shadow-related text embeddings supply a reliable semantic reference capable of disambiguating cast shadows from intrinsically dark surfaces when visual evidence alone is insufficient.

What would settle it

A controlled evaluation on images where dark surfaces are visually similar to shadows but semantically distinct in text embeddings, measuring whether SVL accuracy drops to the level of a vision-only baseline.

Figures

Figures reproduced from arXiv: 2605.11771 by Hao Feng, Houqiang Li, Wengang Zhou, Yonghui Wang.

**Figure 2.** Figure 2: Overall architecture of SVL. A frozen vision encoder extracts multi-level [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed vision–semantic consistency learning. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with representative shadow detection methods. SwinS and SAdapter are short for SwinShadow and ShadowAdapter, respectively. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison with representative shadow detection methods on more challenging cases with shadow–non-shadow ambiguity. From top to [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis of the low-level skip feature. We report [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Failure cases of SVL on thin or entangled shadow patterns. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SVL adds global shadow ratio regression to align DINOv3 features with text embeddings plus a coupling step for dense output, but the language component may not deliver semantic disambiguation beyond extra global supervision.

read the letter

The main takeaway is that this work reframes shadow detection around an explicit language reference to handle cases where dark regions could be shadows or just dark surfaces. They freeze DINOv3, regress a scene-level shadow ratio to pull the global image embedding toward shadow-related text, then propagate that via global-to-local coupling while adding patch-level text constraints. Only lightweight modules get trained, under 1% of parameters total. That efficiency and the focus on ambiguous hard cases are the practical strengths here. The coupling mechanism itself looks like a clean way to move from image-level guidance to pixel predictions without blowing up the model size. Experiments on standard benchmarks plus dedicated hard-case tests apparently deliver competitive numbers, which fits the subfield's needs for robustness without heavy retraining. The soft spot is the one the stress test flags. The ratio target comes straight from pixel annotations, so the regression objective can be met by learning a direct feature-to-scalar map; the text embeddings then act more as a fixed anchor than an independent semantic signal that actually separates cast shadows from intrinsic darkness. Without ablations that swap in unrelated text, drop the language term on ambiguous subsets, or compare against a non-VL global prior, it is hard to tell how much of the reported robustness gain traces to vision-language alignment versus the extra global loss. The abstract claims improved performance under visual ambiguity, but the mechanism description leaves room for the simpler explanation. This is worth a serious referee for the shadow detection community. Researchers already working on dense prediction with limited supervision or language priors will find the framing useful and the efficiency attractive, even if the experiments need tightening to isolate the language contribution. I would send it out rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The paper proposes SVL, a Shadow Vision-Language framework for shadow detection that treats language embeddings as an explicit semantic reference to disambiguate cast shadows from intrinsically dark surfaces. It aligns a global image representation (from frozen DINOv3) with shadow-related text embeddings via a scene-level shadow ratio regression objective, transfers guidance to dense predictions through global-to-local coupling, and adds local patch-level text constraints, all while training only lightweight modules (<1% parameters) and reporting strong results on standard benchmarks plus dedicated hard-case evaluations.

Significance. If the central claims hold, the work offers a parameter-efficient route to injecting semantic priors from language into dense vision tasks, with potential for improved robustness precisely where visual cues are ambiguous. The frozen-encoder design and explicit global-to-local transfer mechanism are practical strengths that could generalize beyond shadow detection.

major comments (1)

[SVL framework (abstract and §3)] The scene-level shadow ratio regression objective (described in the abstract and framework overview) aligns the global image representation to text embeddings by regressing a scalar ratio. This formulation can be satisfied by a direct mapping from DINOv3 features to the ratio without the text embeddings necessarily supplying independent semantic distinctions between cast shadows and dark surfaces; the subsequent global-to-local coupling and local patch constraints then inherit this ambiguity. An ablation that isolates the contribution of the text embeddings (e.g., replacing them with a learned scalar target) is required to substantiate the claim that language provides semantic disambiguation beyond additional global supervision.

minor comments (1)

[Abstract] The abstract states strong benchmark performance and improved robustness under ambiguous conditions but omits any numerical metrics, ablation tables, or error breakdowns; including at least headline numbers and a brief ablation summary would strengthen the presentation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the SVL framework. The concern that the scene-level shadow ratio regression might be satisfiable without the text embeddings contributing independent semantic distinctions is valid and merits direct verification. We will add the requested ablation in the revised manuscript.

read point-by-point responses

Referee: [SVL framework (abstract and §3)] The scene-level shadow ratio regression objective (described in the abstract and framework overview) aligns the global image representation to text embeddings by regressing a scalar ratio. This formulation can be satisfied by a direct mapping from DINOv3 features to the ratio without the text embeddings necessarily supplying independent semantic distinctions between cast shadows and dark surfaces; the subsequent global-to-local coupling and local patch constraints then inherit this ambiguity. An ablation that isolates the contribution of the text embeddings (e.g., replacing them with a learned scalar target) is required to substantiate the claim that language provides semantic disambiguation beyond additional global supervision.

Authors: We agree that an explicit ablation is needed to isolate whether the fixed language embeddings supply semantic distinctions beyond what a scalar regression target could achieve. In the revised manuscript we will add this experiment: we replace the text embeddings with a learned scalar target (while keeping the same regression loss and global-to-local coupling) and report performance on both standard benchmarks and the hard-case subset. We expect the language version to retain an advantage on ambiguous cases because the text embeddings are derived from semantic prompts that encode distinctions such as “cast shadow” versus “intrinsically dark surface,” providing a fixed directional prior in embedding space that a scalar cannot. The global-to-local coupling then transfers this prior rather than a purely numeric signal. We will also clarify in §3 that the alignment objective operates in the joint vision-language space rather than as a simple scalar predictor. revision: yes

Circularity Check

0 steps flagged

SVL derivation is self-contained; no load-bearing steps reduce to fitted inputs or self-citations by construction

full rationale

The paper introduces SVL as a new framework that aligns global image features to shadow-related text embeddings via a scene-level shadow ratio regression objective, then transfers guidance through global-to-local coupling and applies local patch-level text constraints. These components are defined directly in the manuscript without reference to prior fitted parameters from the same authors or equations that equate the claimed outputs to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes imported from prior work appear in the derivation. The shadow ratio serves as standard supervision from annotations rather than a renamed prediction, and the text embeddings function as an external semantic reference rather than a tautological target. Experiments on benchmarks provide external validation, keeping the central claims independent of internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that language embeddings provide useful semantic disambiguation for visual shadow detection, plus standard assumptions about the quality of frozen DINOv3 features and the validity of regression-based alignment objectives. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Shadow-related text embeddings supply semantic information that can reliably disambiguate cast shadows from intrinsically dark surfaces when visual cues are ambiguous.
Invoked as the core justification for using language as an explicit reference in the SVL framework.

pith-pipeline@v0.9.0 · 5550 in / 1306 out tokens · 27966 ms · 2026-05-13T05:56:50.643818+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 2 internal anchors

[1]

Shadow detection: A survey and comparative evaluation of recent methods,

A. Sanin, C. Sanderson, and B. C. Lovell, “Shadow detection: A survey and comparative evaluation of recent methods,”Pattern Recognition, vol. 45, no. 4, pp. 1684–1695, 2012

work page 2012
[2]

Unveiling deep shadows: A survey and benchmark on image and video shadow detection, removal, and generation in the deep learning era,

X. Hu, Z. Xing, T. Wang, C.-W. Fu, and P.-A. Heng, “Unveiling deep shadows: A survey and benchmark on image and video shadow detection, removal, and generation in the deep learning era,”arXiv preprint arXiv:2409.02108, 2024

work page arXiv 2024
[3]

A survey on shadow detection and removal in images and video sequences,

A. Tiwari, P. K. Singh, and S. Amin, “A survey on shadow detection and removal in images and video sequences,” inProceedings of the IEEE International Conference on Cloud Security and Big Data Engineering, 2016, pp. 518–523

work page 2016
[4]

Multi-task learning for automotive foggy scene understanding via domain adaptation to an illumination-invariant representation,

N. Alshammari, S. Akc ¸ay, and T. P. Breckon, “Multi-task learning for automotive foggy scene understanding via domain adaptation to an illumination-invariant representation,”arXiv preprint arXiv:1909.07697, 2019

work page arXiv 1909
[5]

Lanevil: Benchmarking the robustness of lane detection to environmental illusions,

T. Zhang, L. Wang, H. Li, Y . Xiao, S. Liang, A. Liu, X. Liu, and D. Tao, “Lanevil: Benchmarking the robustness of lane detection to environmental illusions,” inProceedings of the ACM International Conference on Multimedia, 2024, pp. 5403–5412

work page 2024
[6]

Autoremover: Automatic object removal for autonomous driving videos,

R. Zhang, W. Li, P. Wang, C. Guan, J. Fang, Y . Song, J. Yu, B. Chen, W. Xu, and R. Yang, “Autoremover: Automatic object removal for autonomous driving videos,” inProceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 12 853–12 861

work page 2020
[7]

Shadow removal for enhanced nighttime driving scene generation

H. Yang, O.-H. Choung, and Y . Ban, “Shadow removal for enhanced nighttime driving scene generation.”Applied Sciences, vol. 14, no. 23, p. 10999, 2024

work page 2024
[8]

Wip: Shadow hack: Adversarial shadow attack against lidar object detection,

R. Kobayashi, K. Nomoto, Y . Tanaka, G. Tsuruoka, and T. Mori, “Wip: Shadow hack: Adversarial shadow attack against lidar object detection,” inProceedings of the Symposium on Vehicle Security and Privacy, 2024, pp. 1–7

work page 2024
[9]

Moving shadow and object detection in traffic scenes,

I. Mikic, P. C. Cosman, G. T. Kogut, and M. M. Trivedi, “Moving shadow and object detection in traffic scenes,” inProceedings of the International Conference on Pattern Recognition, 2000, pp. 321–324

work page 2000
[10]

Skeprid: Pose and illumination change-resistant skeleton-based person re-identification,

T. Yu, H. Jin, W.-T. Tan, and K. Nahrstedt, “Skeprid: Pose and illumination change-resistant skeleton-based person re-identification,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 14, no. 4, pp. 1–24, 2018

work page 2018
[11]

An efficient motion detection and tracking scheme for encrypted surveillance videos,

J. Guo, P. Zheng, and J. Huang, “An efficient motion detection and tracking scheme for encrypted surveillance videos,”ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 13, no. 4, pp. 1–23, 2017

work page 2017
[12]

Fast shadow detection from a single image using a patched convolutional neural network,

S. Hosseinzadeh, M. Shakeri, and H. Zhang, “Fast shadow detection from a single image using a patched convolutional neural network,” in Proceedings of the International Conference on Intelligent Robots and Systems, 2018, pp. 3124–3129

work page 2018
[13]

Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal,

J. Wang, X. Li, and J. Yang, “Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1788–1797

work page 2018
[14]

Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection,

L. Zhu, Z. Deng, X. Hu, C.-W. Fu, X. Xu, J. Qin, and P.-A. Heng, “Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection,” inProceedings of the European Con- ference on Computer Vision, 2018, pp. 121–136

work page 2018
[15]

A+d net: Training a shadow detector with adversarial shadow attenuation,

H. Le, T. F. Y . Vicente, V . Nguyen, M. Hoai, and D. Samaras, “A+d net: Training a shadow detector with adversarial shadow attenuation,” inProceedings of the European Conference on Computer Vision, 2018, pp. 662–678

work page 2018
[16]

Semantic-aware transformer for shadow detection,

K. Zhou, J.-L. Fang, W. Wu, Y .-L. Shao, X.-Q. Wang, and D. Wei, “Semantic-aware transformer for shadow detection,”Computer Vision and Image Understanding, vol. 240, no. 1, p. 103941, 2024

work page 2024
[17]

Shadowadapter: Adapting segment anything model with auto-prompt for shadow detection,

L. Jie and H. Zhang, “Shadowadapter: Adapting segment anything model with auto-prompt for shadow detection,”Expert Systems with Applications, vol. 273, no. 1, p. 126809, 2025

work page 2025
[18]

Using color to separate reflection components,

S. A. Shafer, “Using color to separate reflection components,”Color Research & Application, vol. 10, no. 4, pp. 210–218, 1985

work page 1985
[19]

A bi-illuminant dichromatic reflection model for understanding images,

B. A. Maxwell, R. M. Friedhoff, and C. A. Smith, “A bi-illuminant dichromatic reflection model for understanding images,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8

work page 2008
[20]

Moving cast shadow detection using physics-based features,

J.-B. Huang and C.-S. Chen, “Moving cast shadow detection using physics-based features,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2310–2317

work page 2009
[21]

Learning to recognize shadows in monochromatic natural images,

J. Zhu, K. G. Samuel, S. Z. Masood, and M. F. Tappen, “Learning to recognize shadows in monochromatic natural images,” inProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 223–230

work page 2010
[22]

Automatic shadow detection and removal from a single image,

S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri, “Automatic shadow detection and removal from a single image,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 431– 446, 2015

work page 2015
[23]

Shadow detection in single rgb images using a context preserver convolutional neural network trained by multiple adversarial examples,

S. Mohajerani and P. Saeedi, “Shadow detection in single rgb images using a context preserver convolutional neural network trained by multiple adversarial examples,”IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 4117–4129, 2019

work page 2019
[24]

Direction-aware spatial context features for shadow detection,

X. Hu, L. Zhu, C.-W. Fu, J. Qin, and P.-A. Heng, “Direction-aware spatial context features for shadow detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7454–7462

work page 2018
[25]

Rmlanet: Random multi-level attention network for shadow detection and removal,

L. Jie and H. Zhang, “Rmlanet: Random multi-level attention network for shadow detection and removal,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 12, pp. 7819–7831, 2023

work page 2023
[26]

Distraction-aware shadow detection,

Q. Zheng, X. Qiao, Y . Cao, and R. W. Lau, “Distraction-aware shadow detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5167–5176

work page 2019
[27]

Mitigating intensity bias in shadow detection via feature decomposition and reweighting,

L. Zhu, K. Xu, Z. Ke, and R. W. Lau, “Mitigating intensity bias in shadow detection via feature decomposition and reweighting,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 4702–4711

work page 2021
[28]

Single image shadow detection via complementary mechanism,

Y . Zhu, X. Fu, C. Cao, X. Wang, Q. Sun, and Z.-J. Zha, “Single image shadow detection via complementary mechanism,” inProceedings of the ACM International Conference on Multimedia, 2022, pp. 6717–6726

work page 2022
[29]

Pay more attention to dark regions for faster shadow detection,

X.-T. Wu, X.-D. Chen, H. Chen, W. Wu, W. Ma, and H. Song, “Pay more attention to dark regions for faster shadow detection,”Computer Vision and Image Understanding, vol. 263, no. 1, p. 104589, 2026

work page 2026
[30]

Under the shadow: Exploiting opacity variation for fine-grained shadow detection,

X. Qiao, K. Xu, X. Yang, R. Dong, X. Xia, and J. Cui, “Under the shadow: Exploiting opacity variation for fine-grained shadow detection,” inProceedings of the Advances in Neural Information Processing Systems, 2025, pp. 1–13

work page 2025
[31]

Large scale shadow annotation and detection using lazy annotation and stacked cnns,

L. Hou, T. F. Y . Vicente, M. Hoai, and D. Samaras, “Large scale shadow annotation and detection using lazy annotation and stacked cnns,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1337–1351, 2019

work page 2019
[32]

Revisiting shadow detection: A new benchmark dataset for complex world,

X. Hu, T. Wang, C.-W. Fu, Y . Jiang, Q. Wang, and P.-A. Heng, “Revisiting shadow detection: A new benchmark dataset for complex world,”IEEE Transactions on Image Processing, vol. 30, no. 1, pp. 1925–1934, 2021

work page 1925
[33]

Learning from synthetic shadows for shadow detection and removal,

N. Inoue and T. Yamasaki, “Learning from synthetic shadows for shadow detection and removal,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 11, pp. 4187–4197, 2020

work page 2020
[34]

Shadow generation for composite image in real-world scenes,

Y . Hong, L. Niu, and J. Zhang, “Shadow generation for composite image in real-world scenes,” inProceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 914–922. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 1, NO. 1, APRIL 2026 12

work page 2022
[35]

A multi-task mean teacher for semi-supervised shadow detection,

Z. Chen, L. Zhu, L. Wan, S. Wang, W. Feng, and P.-A. Heng, “A multi-task mean teacher for semi-supervised shadow detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 5611–5620

work page 2020
[36]

Silt: Shadow-aware iterative label tuning for learning to detect shadows from noisy labels,

H. Yang, T. Wang, X. Hu, and C.-W. Fu, “Silt: Shadow-aware iterative label tuning for learning to detect shadows from noisy labels,” in Proceedings of the IEEE International Conference on Computer Vision, 2023, pp. 12 687–12 698

work page 2023
[37]

A self-correction algorithm for transparent object shadow detection,

J. Li, S. Wen, R. Chen, D. Lu, J. Hu, and H. Zhang, “A self-correction algorithm for transparent object shadow detection,”Applied Intelligence, vol. 55, no. 4, p. 275, 2025

work page 2025
[38]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[39]

Detect any shadow: Segment anything for video shadow detection,

Y . Wang, W. Zhou, Y . Mao, and H. Li, “Detect any shadow: Segment anything for video shadow detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 5, pp. 3782–3794, 2023

work page 2023
[40]

Swinshadow: Shifted window for ambiguous adjacent shadow detection,

Y . Wang, S. Liu, L. Li, W. Zhou, and H. Li, “Swinshadow: Shifted window for ambiguous adjacent shadow detection,”ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 20, no. 11, pp. 1–20, 2024

work page 2024
[41]

Structure-aware transformer for shadow detection,

W. Sun, L. Xiang, and W. Zhao, “Structure-aware transformer for shadow detection,”IET Image Processing, vol. 19, no. 1, p. e70031, 2025

work page 2025
[42]

Make seg- ment anything model perfect on shadow detection,

X.-D. Chen, W. Wu, W. Yang, H. Qin, X. Wu, and X. Mao, “Make seg- ment anything model perfect on shadow detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, no. 1, pp. 1–13, 2023

work page 2023
[43]

Omni-supervised shadow detection with vision foundation model,

Z. Qian, W. Wu, X.-T. Wu, and X.-D. Chen, “Omni-supervised shadow detection with vision foundation model,”Journal of Visual Communi- cation and Image Representation, vol. 100, no. 1, p. 104146, 2024

work page 2024
[44]

Language- driven interactive shadow detection,

H. Wang, W. Wang, H. Zhou, H. Xu, S. Wu, and L. Zhu, “Language- driven interactive shadow detection,” inProceedings of the ACM Inter- national Conference on Multimedia, 2024, pp. 5527–5536

work page 2024
[45]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning, 2021, pp. 8748–8763

work page 2021
[46]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProceedings of the International Conference on Machine Learning, 2022, pp. 12 888–12 900

work page 2022
[47]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .- H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProceedings of the International Conference on Machine Learning, 2021, pp. 4904– 4916

work page 2021
[48]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inProceedings of the Advances in Neural Information Processing Systems, 2022, pp. 23 716–23 736

work page 2022
[49]

Denseclip: Language-guided dense prediction with context-aware prompting,

Y . Rao, W. Zhao, G. Chen, Y . Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 082–18 091

work page 2022
[50]

Open-vocabulary object detection via vision and language knowledge distillation,

X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation,”arXiv preprint arXiv:2104.13921, 2021

work page arXiv 2021
[51]

Open vocabulary semantic segmentation with patch aligned contrastive learning,

J. Mukhoti, T.-Y . Lin, O. Poursaeed, R. Wang, A. Shah, P. H. Torr, and S.-N. Lim, “Open vocabulary semantic segmentation with patch aligned contrastive learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 413–19 423

work page 2023
[52]

Conditional prompt learning for vision-language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 816–16 825

work page 2022
[53]

Grounded language-image pre- training,

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre- training,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975

work page 2022
[54]

Vlt: Vision-language transformer and query generation for referring segmentation,

H. Ding, C. Liu, S. Wang, and X. Jiang, “Vlt: Vision-language transformer and query generation for referring segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7900–7916, 2022

work page 2022
[55]

Segmentation from natural language expressions,

R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural language expressions,” inProceedings of the European Conference on Computer Vision, 2016, pp. 108–124

work page 2016
[56]

Cross-modal self-attention network for referring image segmentation,

L. Ye, M. Rochan, Z. Liu, and Y . Wang, “Cross-modal self-attention network for referring image segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 502–10 511

work page 2019
[57]

Language-driven semantic segmentation,

B. Li, K. Q. Weinberger, S. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,”arXiv preprint arXiv:2201.03546, 2022

work page arXiv 2022
[58]

Refer- ring image segmentation via cross-modal progressive comprehension,

S. Huang, T. Hui, S. Liu, G. Li, Y . Wei, J. Han, L. Liu, and B. Li, “Refer- ring image segmentation via cross-modal progressive comprehension,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 488–10 497

work page 2020
[59]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Proceedings of the Advances in Neural Information Processing Systems, 2023, pp. 34 892–34 916

work page 2023
[60]

arXiv preprint arXiv:1610.02242 , year=

S. Laine and T. Aila, “Temporal ensembling for semi-supervised learn- ing,”arXiv preprint arXiv:1610.02242, 2016

work page arXiv 2016
[61]

Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,

A. Tarvainen and H. Valpola, “Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,” inProceedings of the Advances in Neural Information Processing Systems, 2017, pp. 1195–1204

work page 2017
[62]

Dual contrastive prediction for incomplete multi-view representation learning,

Y . Lin, Y . Gou, X. Liu, J. Bai, J. Lv, and X. Peng, “Dual contrastive prediction for incomplete multi-view representation learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4447–4461, 2022

work page 2022
[63]

Self-ensembling for visual domain adaptation,

G. French, M. Mackiewicz, and M. Fisher, “Self-ensembling for visual domain adaptation,”arXiv preprint arXiv:1706.05208, 2017

work page arXiv 2017
[64]

Semantically consistent multi- view representation learning,

Y . Zhou, Q. Zheng, S. Bai, and J. Zhu, “Semantically consistent multi- view representation learning,”Knowledge-Based Systems, vol. 278, no. 1, p. 110899, 2023

work page 2023
[65]

Cross-modal retrieval with partially mismatched pairs,

P. Hu, Z. Huang, D. Peng, X. Wang, and X. Peng, “Cross-modal retrieval with partially mismatched pairs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 9595–9610, 2023

work page 2023
[66]

Semi-supervised semantic segmentation with cross pseudo supervision,

X. Chen, Y . Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 2613–2622

work page 2021
[67]

Semi-supervised semantic segmentation with prototype-based consistency regularization,

H. Xu, L. Liu, Q. Bian, and Z. Yang, “Semi-supervised semantic segmentation with prototype-based consistency regularization,” inPro- ceedings of the Advances in Neural Information Processing Systems, 2022, pp. 26 007–26 020

work page 2022
[68]

Vision-language models for vision tasks: A survey,

J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5625–5644, 2024

work page 2024
[69]

Pyramid scene parsing network,

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890

work page 2017
[70]

A stagewise refinement model for detecting salient objects in images,

T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise refinement model for detecting salient objects in images,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4019–4028

work page 2017
[71]

Reverse attention for salient ob- ject detection,

S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient ob- ject detection,” inProceedings of the European Conference on Computer Vision, 2018, pp. 234–250

work page 2018
[72]

Eg- net: Edge guidance network for salient object detection,

J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y . Cao, J. Yang, and M.-M. Cheng, “Eg- net: Edge guidance network for salient object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8779–8788

work page 2019
[73]

Interactive two- stream decoder for accurate and fast saliency detection,

H. Zhou, X. Xie, J.-H. Lai, Z. Chen, and L. Yang, “Interactive two- stream decoder for accurate and fast saliency detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9141–9150

work page 2020
[74]

Single-image shadow detection and removal using paired regions,

R. Guo, Q. Dai, and D. Hoiem, “Single-image shadow detection and removal using paired regions,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2033–2040

work page 2011
[75]

Densely cascaded shadow detection network via deeply supervised parallel fu- sion

Y . Wang, X. Zhao, Y . Li, X. Hu, K. Huang, and N. CRIPAC, “Densely cascaded shadow detection network via deeply supervised parallel fu- sion.” inProceedings of the International Joint Conferences on Artificial Intelligence, 2018, p. 6

work page 2018
[76]

Fine-context shadow detection using shadow removal,

J. M. J. Valanarasu and V . M. Patel, “Fine-context shadow detection using shadow removal,” inProceedings of the IEEE Winter Conference on Applications of Computer Vision, 2023, pp. 1705–1714

work page 2023
[77]

Large- scale training of shadow detectors with noisily-annotated shadow exam- ples,

T. F. Y . Vicente, L. Hou, C.-P. Yu, M. Hoai, and D. Samaras, “Large- scale training of shadow detectors with noisily-annotated shadow exam- ples,” inProceedings of the European Conference on Computer Vision, 2016, pp. 816–832

work page 2016
[78]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 1, NO. 1, APRIL 2026 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

An overview of gradient descent optimization algorithms,

S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016

work page arXiv 2016
[80]

Efficient inference in fully connected crfs with gaussian edge potentials,

P. Kr ¨ahenb¨uhl and V . Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” inProceedings of the Advances in Neural Information Processing Systems, 2011, pp. 109–117

work page 2011