Recognition: 2 theorem links
· Lean TheoremRevisiting Shadow Detection from a Vision-Language Perspective
Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3
The pith
Shadow detection gains robustness by aligning global image features with shadow-related language embeddings to resolve visual ambiguities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that robust shadow detection requires an explicit semantic reference from language beyond visual cues alone; SVL achieves this by aligning the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, then transfers the guidance to dense inference via global-to-local coupling while applying local patch-level text constraints, producing improved performance and robustness under visually ambiguous conditions.
What carries the argument
SVL, the Shadow Vision-Language framework that aligns global image representations with shadow text embeddings via scene-level shadow ratio regression and transfers guidance through global-to-local coupling plus local text constraints.
If this is right
- Strong overall performance is achieved across multiple standard shadow detection benchmarks.
- Robustness increases specifically in visually ambiguous conditions where visual cues alone are unreliable.
- The design remains parameter-efficient by training less than 1 percent of parameters on top of a frozen image encoder.
Where Pith is reading between the lines
- The same global-to-local coupling pattern could be tested in other dense prediction tasks that involve appearance ambiguities, such as distinguishing reflections from objects.
- Performance may vary with the choice of text embedding model, suggesting experiments that swap different language encoders while keeping the regression objective fixed.
- Reducing reliance on purely visual supervision through language references could lower annotation costs for training shadow detectors in new environments.
Load-bearing premise
Shadow-related text embeddings supply a reliable semantic reference capable of disambiguating cast shadows from intrinsically dark surfaces when visual evidence alone is insufficient.
What would settle it
A controlled evaluation on images where dark surfaces are visually similar to shadows but semantically distinct in text embeddings, measuring whether SVL accuracy drops to the level of a vision-only baseline.
Figures
read the original abstract
Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SVL, a Shadow Vision-Language framework for shadow detection that treats language embeddings as an explicit semantic reference to disambiguate cast shadows from intrinsically dark surfaces. It aligns a global image representation (from frozen DINOv3) with shadow-related text embeddings via a scene-level shadow ratio regression objective, transfers guidance to dense predictions through global-to-local coupling, and adds local patch-level text constraints, all while training only lightweight modules (<1% parameters) and reporting strong results on standard benchmarks plus dedicated hard-case evaluations.
Significance. If the central claims hold, the work offers a parameter-efficient route to injecting semantic priors from language into dense vision tasks, with potential for improved robustness precisely where visual cues are ambiguous. The frozen-encoder design and explicit global-to-local transfer mechanism are practical strengths that could generalize beyond shadow detection.
major comments (1)
- [SVL framework (abstract and §3)] The scene-level shadow ratio regression objective (described in the abstract and framework overview) aligns the global image representation to text embeddings by regressing a scalar ratio. This formulation can be satisfied by a direct mapping from DINOv3 features to the ratio without the text embeddings necessarily supplying independent semantic distinctions between cast shadows and dark surfaces; the subsequent global-to-local coupling and local patch constraints then inherit this ambiguity. An ablation that isolates the contribution of the text embeddings (e.g., replacing them with a learned scalar target) is required to substantiate the claim that language provides semantic disambiguation beyond additional global supervision.
minor comments (1)
- [Abstract] The abstract states strong benchmark performance and improved robustness under ambiguous conditions but omits any numerical metrics, ablation tables, or error breakdowns; including at least headline numbers and a brief ablation summary would strengthen the presentation.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the SVL framework. The concern that the scene-level shadow ratio regression might be satisfiable without the text embeddings contributing independent semantic distinctions is valid and merits direct verification. We will add the requested ablation in the revised manuscript.
read point-by-point responses
-
Referee: [SVL framework (abstract and §3)] The scene-level shadow ratio regression objective (described in the abstract and framework overview) aligns the global image representation to text embeddings by regressing a scalar ratio. This formulation can be satisfied by a direct mapping from DINOv3 features to the ratio without the text embeddings necessarily supplying independent semantic distinctions between cast shadows and dark surfaces; the subsequent global-to-local coupling and local patch constraints then inherit this ambiguity. An ablation that isolates the contribution of the text embeddings (e.g., replacing them with a learned scalar target) is required to substantiate the claim that language provides semantic disambiguation beyond additional global supervision.
Authors: We agree that an explicit ablation is needed to isolate whether the fixed language embeddings supply semantic distinctions beyond what a scalar regression target could achieve. In the revised manuscript we will add this experiment: we replace the text embeddings with a learned scalar target (while keeping the same regression loss and global-to-local coupling) and report performance on both standard benchmarks and the hard-case subset. We expect the language version to retain an advantage on ambiguous cases because the text embeddings are derived from semantic prompts that encode distinctions such as “cast shadow” versus “intrinsically dark surface,” providing a fixed directional prior in embedding space that a scalar cannot. The global-to-local coupling then transfers this prior rather than a purely numeric signal. We will also clarify in §3 that the alignment objective operates in the joint vision-language space rather than as a simple scalar predictor. revision: yes
Circularity Check
SVL derivation is self-contained; no load-bearing steps reduce to fitted inputs or self-citations by construction
full rationale
The paper introduces SVL as a new framework that aligns global image features to shadow-related text embeddings via a scene-level shadow ratio regression objective, then transfers guidance through global-to-local coupling and applies local patch-level text constraints. These components are defined directly in the manuscript without reference to prior fitted parameters from the same authors or equations that equate the claimed outputs to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes imported from prior work appear in the derivation. The shadow ratio serves as standard supervision from annotations rather than a renamed prediction, and the text embeddings function as an external semantic reference rather than a tautological target. Experiments on benchmarks provide external validation, keeping the central claims independent of internal reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Shadow-related text embeddings supply semantic information that can reliably disambiguate cast shadows from intrinsically dark surfaces when visual cues are ambiguous.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shadow detection: A survey and comparative evaluation of recent methods,
A. Sanin, C. Sanderson, and B. C. Lovell, “Shadow detection: A survey and comparative evaluation of recent methods,”Pattern Recognition, vol. 45, no. 4, pp. 1684–1695, 2012
work page 2012
-
[2]
X. Hu, Z. Xing, T. Wang, C.-W. Fu, and P.-A. Heng, “Unveiling deep shadows: A survey and benchmark on image and video shadow detection, removal, and generation in the deep learning era,”arXiv preprint arXiv:2409.02108, 2024
-
[3]
A survey on shadow detection and removal in images and video sequences,
A. Tiwari, P. K. Singh, and S. Amin, “A survey on shadow detection and removal in images and video sequences,” inProceedings of the IEEE International Conference on Cloud Security and Big Data Engineering, 2016, pp. 518–523
work page 2016
-
[4]
N. Alshammari, S. Akc ¸ay, and T. P. Breckon, “Multi-task learning for automotive foggy scene understanding via domain adaptation to an illumination-invariant representation,”arXiv preprint arXiv:1909.07697, 2019
-
[5]
Lanevil: Benchmarking the robustness of lane detection to environmental illusions,
T. Zhang, L. Wang, H. Li, Y . Xiao, S. Liang, A. Liu, X. Liu, and D. Tao, “Lanevil: Benchmarking the robustness of lane detection to environmental illusions,” inProceedings of the ACM International Conference on Multimedia, 2024, pp. 5403–5412
work page 2024
-
[6]
Autoremover: Automatic object removal for autonomous driving videos,
R. Zhang, W. Li, P. Wang, C. Guan, J. Fang, Y . Song, J. Yu, B. Chen, W. Xu, and R. Yang, “Autoremover: Automatic object removal for autonomous driving videos,” inProceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 12 853–12 861
work page 2020
-
[7]
Shadow removal for enhanced nighttime driving scene generation
H. Yang, O.-H. Choung, and Y . Ban, “Shadow removal for enhanced nighttime driving scene generation.”Applied Sciences, vol. 14, no. 23, p. 10999, 2024
work page 2024
-
[8]
Wip: Shadow hack: Adversarial shadow attack against lidar object detection,
R. Kobayashi, K. Nomoto, Y . Tanaka, G. Tsuruoka, and T. Mori, “Wip: Shadow hack: Adversarial shadow attack against lidar object detection,” inProceedings of the Symposium on Vehicle Security and Privacy, 2024, pp. 1–7
work page 2024
-
[9]
Moving shadow and object detection in traffic scenes,
I. Mikic, P. C. Cosman, G. T. Kogut, and M. M. Trivedi, “Moving shadow and object detection in traffic scenes,” inProceedings of the International Conference on Pattern Recognition, 2000, pp. 321–324
work page 2000
-
[10]
Skeprid: Pose and illumination change-resistant skeleton-based person re-identification,
T. Yu, H. Jin, W.-T. Tan, and K. Nahrstedt, “Skeprid: Pose and illumination change-resistant skeleton-based person re-identification,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 14, no. 4, pp. 1–24, 2018
work page 2018
-
[11]
An efficient motion detection and tracking scheme for encrypted surveillance videos,
J. Guo, P. Zheng, and J. Huang, “An efficient motion detection and tracking scheme for encrypted surveillance videos,”ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 13, no. 4, pp. 1–23, 2017
work page 2017
-
[12]
Fast shadow detection from a single image using a patched convolutional neural network,
S. Hosseinzadeh, M. Shakeri, and H. Zhang, “Fast shadow detection from a single image using a patched convolutional neural network,” in Proceedings of the International Conference on Intelligent Robots and Systems, 2018, pp. 3124–3129
work page 2018
-
[13]
J. Wang, X. Li, and J. Yang, “Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1788–1797
work page 2018
-
[14]
L. Zhu, Z. Deng, X. Hu, C.-W. Fu, X. Xu, J. Qin, and P.-A. Heng, “Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection,” inProceedings of the European Con- ference on Computer Vision, 2018, pp. 121–136
work page 2018
-
[15]
A+d net: Training a shadow detector with adversarial shadow attenuation,
H. Le, T. F. Y . Vicente, V . Nguyen, M. Hoai, and D. Samaras, “A+d net: Training a shadow detector with adversarial shadow attenuation,” inProceedings of the European Conference on Computer Vision, 2018, pp. 662–678
work page 2018
-
[16]
Semantic-aware transformer for shadow detection,
K. Zhou, J.-L. Fang, W. Wu, Y .-L. Shao, X.-Q. Wang, and D. Wei, “Semantic-aware transformer for shadow detection,”Computer Vision and Image Understanding, vol. 240, no. 1, p. 103941, 2024
work page 2024
-
[17]
Shadowadapter: Adapting segment anything model with auto-prompt for shadow detection,
L. Jie and H. Zhang, “Shadowadapter: Adapting segment anything model with auto-prompt for shadow detection,”Expert Systems with Applications, vol. 273, no. 1, p. 126809, 2025
work page 2025
-
[18]
Using color to separate reflection components,
S. A. Shafer, “Using color to separate reflection components,”Color Research & Application, vol. 10, no. 4, pp. 210–218, 1985
work page 1985
-
[19]
A bi-illuminant dichromatic reflection model for understanding images,
B. A. Maxwell, R. M. Friedhoff, and C. A. Smith, “A bi-illuminant dichromatic reflection model for understanding images,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8
work page 2008
-
[20]
Moving cast shadow detection using physics-based features,
J.-B. Huang and C.-S. Chen, “Moving cast shadow detection using physics-based features,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2310–2317
work page 2009
-
[21]
Learning to recognize shadows in monochromatic natural images,
J. Zhu, K. G. Samuel, S. Z. Masood, and M. F. Tappen, “Learning to recognize shadows in monochromatic natural images,” inProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 223–230
work page 2010
-
[22]
Automatic shadow detection and removal from a single image,
S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri, “Automatic shadow detection and removal from a single image,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 431– 446, 2015
work page 2015
-
[23]
S. Mohajerani and P. Saeedi, “Shadow detection in single rgb images using a context preserver convolutional neural network trained by multiple adversarial examples,”IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 4117–4129, 2019
work page 2019
-
[24]
Direction-aware spatial context features for shadow detection,
X. Hu, L. Zhu, C.-W. Fu, J. Qin, and P.-A. Heng, “Direction-aware spatial context features for shadow detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7454–7462
work page 2018
-
[25]
Rmlanet: Random multi-level attention network for shadow detection and removal,
L. Jie and H. Zhang, “Rmlanet: Random multi-level attention network for shadow detection and removal,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 12, pp. 7819–7831, 2023
work page 2023
-
[26]
Distraction-aware shadow detection,
Q. Zheng, X. Qiao, Y . Cao, and R. W. Lau, “Distraction-aware shadow detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5167–5176
work page 2019
-
[27]
Mitigating intensity bias in shadow detection via feature decomposition and reweighting,
L. Zhu, K. Xu, Z. Ke, and R. W. Lau, “Mitigating intensity bias in shadow detection via feature decomposition and reweighting,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 4702–4711
work page 2021
-
[28]
Single image shadow detection via complementary mechanism,
Y . Zhu, X. Fu, C. Cao, X. Wang, Q. Sun, and Z.-J. Zha, “Single image shadow detection via complementary mechanism,” inProceedings of the ACM International Conference on Multimedia, 2022, pp. 6717–6726
work page 2022
-
[29]
Pay more attention to dark regions for faster shadow detection,
X.-T. Wu, X.-D. Chen, H. Chen, W. Wu, W. Ma, and H. Song, “Pay more attention to dark regions for faster shadow detection,”Computer Vision and Image Understanding, vol. 263, no. 1, p. 104589, 2026
work page 2026
-
[30]
Under the shadow: Exploiting opacity variation for fine-grained shadow detection,
X. Qiao, K. Xu, X. Yang, R. Dong, X. Xia, and J. Cui, “Under the shadow: Exploiting opacity variation for fine-grained shadow detection,” inProceedings of the Advances in Neural Information Processing Systems, 2025, pp. 1–13
work page 2025
-
[31]
Large scale shadow annotation and detection using lazy annotation and stacked cnns,
L. Hou, T. F. Y . Vicente, M. Hoai, and D. Samaras, “Large scale shadow annotation and detection using lazy annotation and stacked cnns,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1337–1351, 2019
work page 2019
-
[32]
Revisiting shadow detection: A new benchmark dataset for complex world,
X. Hu, T. Wang, C.-W. Fu, Y . Jiang, Q. Wang, and P.-A. Heng, “Revisiting shadow detection: A new benchmark dataset for complex world,”IEEE Transactions on Image Processing, vol. 30, no. 1, pp. 1925–1934, 2021
work page 1925
-
[33]
Learning from synthetic shadows for shadow detection and removal,
N. Inoue and T. Yamasaki, “Learning from synthetic shadows for shadow detection and removal,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 11, pp. 4187–4197, 2020
work page 2020
-
[34]
Shadow generation for composite image in real-world scenes,
Y . Hong, L. Niu, and J. Zhang, “Shadow generation for composite image in real-world scenes,” inProceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 914–922. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 1, NO. 1, APRIL 2026 12
work page 2022
-
[35]
A multi-task mean teacher for semi-supervised shadow detection,
Z. Chen, L. Zhu, L. Wan, S. Wang, W. Feng, and P.-A. Heng, “A multi-task mean teacher for semi-supervised shadow detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 5611–5620
work page 2020
-
[36]
Silt: Shadow-aware iterative label tuning for learning to detect shadows from noisy labels,
H. Yang, T. Wang, X. Hu, and C.-W. Fu, “Silt: Shadow-aware iterative label tuning for learning to detect shadows from noisy labels,” in Proceedings of the IEEE International Conference on Computer Vision, 2023, pp. 12 687–12 698
work page 2023
-
[37]
A self-correction algorithm for transparent object shadow detection,
J. Li, S. Wen, R. Chen, D. Lu, J. Hu, and H. Zhang, “A self-correction algorithm for transparent object shadow detection,”Applied Intelligence, vol. 55, no. 4, p. 275, 2025
work page 2025
-
[38]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[39]
Detect any shadow: Segment anything for video shadow detection,
Y . Wang, W. Zhou, Y . Mao, and H. Li, “Detect any shadow: Segment anything for video shadow detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 5, pp. 3782–3794, 2023
work page 2023
-
[40]
Swinshadow: Shifted window for ambiguous adjacent shadow detection,
Y . Wang, S. Liu, L. Li, W. Zhou, and H. Li, “Swinshadow: Shifted window for ambiguous adjacent shadow detection,”ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 20, no. 11, pp. 1–20, 2024
work page 2024
-
[41]
Structure-aware transformer for shadow detection,
W. Sun, L. Xiang, and W. Zhao, “Structure-aware transformer for shadow detection,”IET Image Processing, vol. 19, no. 1, p. e70031, 2025
work page 2025
-
[42]
Make seg- ment anything model perfect on shadow detection,
X.-D. Chen, W. Wu, W. Yang, H. Qin, X. Wu, and X. Mao, “Make seg- ment anything model perfect on shadow detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, no. 1, pp. 1–13, 2023
work page 2023
-
[43]
Omni-supervised shadow detection with vision foundation model,
Z. Qian, W. Wu, X.-T. Wu, and X.-D. Chen, “Omni-supervised shadow detection with vision foundation model,”Journal of Visual Communi- cation and Image Representation, vol. 100, no. 1, p. 104146, 2024
work page 2024
-
[44]
Language- driven interactive shadow detection,
H. Wang, W. Wang, H. Zhou, H. Xu, S. Wu, and L. Zhu, “Language- driven interactive shadow detection,” inProceedings of the ACM Inter- national Conference on Multimedia, 2024, pp. 5527–5536
work page 2024
-
[45]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning, 2021, pp. 8748–8763
work page 2021
-
[46]
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProceedings of the International Conference on Machine Learning, 2022, pp. 12 888–12 900
work page 2022
-
[47]
Scaling up visual and vision-language representation learning with noisy text supervision,
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .- H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProceedings of the International Conference on Machine Learning, 2021, pp. 4904– 4916
work page 2021
-
[48]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inProceedings of the Advances in Neural Information Processing Systems, 2022, pp. 23 716–23 736
work page 2022
-
[49]
Denseclip: Language-guided dense prediction with context-aware prompting,
Y . Rao, W. Zhao, G. Chen, Y . Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 082–18 091
work page 2022
-
[50]
Open-vocabulary object detection via vision and language knowledge distillation,
X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation,”arXiv preprint arXiv:2104.13921, 2021
-
[51]
Open vocabulary semantic segmentation with patch aligned contrastive learning,
J. Mukhoti, T.-Y . Lin, O. Poursaeed, R. Wang, A. Shah, P. H. Torr, and S.-N. Lim, “Open vocabulary semantic segmentation with patch aligned contrastive learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 413–19 423
work page 2023
-
[52]
Conditional prompt learning for vision-language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 816–16 825
work page 2022
-
[53]
Grounded language-image pre- training,
L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre- training,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975
work page 2022
-
[54]
Vlt: Vision-language transformer and query generation for referring segmentation,
H. Ding, C. Liu, S. Wang, and X. Jiang, “Vlt: Vision-language transformer and query generation for referring segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7900–7916, 2022
work page 2022
-
[55]
Segmentation from natural language expressions,
R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural language expressions,” inProceedings of the European Conference on Computer Vision, 2016, pp. 108–124
work page 2016
-
[56]
Cross-modal self-attention network for referring image segmentation,
L. Ye, M. Rochan, Z. Liu, and Y . Wang, “Cross-modal self-attention network for referring image segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 502–10 511
work page 2019
-
[57]
Language-driven semantic segmentation,
B. Li, K. Q. Weinberger, S. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,”arXiv preprint arXiv:2201.03546, 2022
-
[58]
Refer- ring image segmentation via cross-modal progressive comprehension,
S. Huang, T. Hui, S. Liu, G. Li, Y . Wei, J. Han, L. Liu, and B. Li, “Refer- ring image segmentation via cross-modal progressive comprehension,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 488–10 497
work page 2020
-
[59]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Proceedings of the Advances in Neural Information Processing Systems, 2023, pp. 34 892–34 916
work page 2023
-
[60]
arXiv preprint arXiv:1610.02242 , year=
S. Laine and T. Aila, “Temporal ensembling for semi-supervised learn- ing,”arXiv preprint arXiv:1610.02242, 2016
-
[61]
A. Tarvainen and H. Valpola, “Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,” inProceedings of the Advances in Neural Information Processing Systems, 2017, pp. 1195–1204
work page 2017
-
[62]
Dual contrastive prediction for incomplete multi-view representation learning,
Y . Lin, Y . Gou, X. Liu, J. Bai, J. Lv, and X. Peng, “Dual contrastive prediction for incomplete multi-view representation learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4447–4461, 2022
work page 2022
-
[63]
Self-ensembling for visual domain adaptation,
G. French, M. Mackiewicz, and M. Fisher, “Self-ensembling for visual domain adaptation,”arXiv preprint arXiv:1706.05208, 2017
-
[64]
Semantically consistent multi- view representation learning,
Y . Zhou, Q. Zheng, S. Bai, and J. Zhu, “Semantically consistent multi- view representation learning,”Knowledge-Based Systems, vol. 278, no. 1, p. 110899, 2023
work page 2023
-
[65]
Cross-modal retrieval with partially mismatched pairs,
P. Hu, Z. Huang, D. Peng, X. Wang, and X. Peng, “Cross-modal retrieval with partially mismatched pairs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 9595–9610, 2023
work page 2023
-
[66]
Semi-supervised semantic segmentation with cross pseudo supervision,
X. Chen, Y . Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 2613–2622
work page 2021
-
[67]
Semi-supervised semantic segmentation with prototype-based consistency regularization,
H. Xu, L. Liu, Q. Bian, and Z. Yang, “Semi-supervised semantic segmentation with prototype-based consistency regularization,” inPro- ceedings of the Advances in Neural Information Processing Systems, 2022, pp. 26 007–26 020
work page 2022
-
[68]
Vision-language models for vision tasks: A survey,
J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5625–5644, 2024
work page 2024
-
[69]
Pyramid scene parsing network,
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890
work page 2017
-
[70]
A stagewise refinement model for detecting salient objects in images,
T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise refinement model for detecting salient objects in images,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4019–4028
work page 2017
-
[71]
Reverse attention for salient ob- ject detection,
S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient ob- ject detection,” inProceedings of the European Conference on Computer Vision, 2018, pp. 234–250
work page 2018
-
[72]
Eg- net: Edge guidance network for salient object detection,
J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y . Cao, J. Yang, and M.-M. Cheng, “Eg- net: Edge guidance network for salient object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8779–8788
work page 2019
-
[73]
Interactive two- stream decoder for accurate and fast saliency detection,
H. Zhou, X. Xie, J.-H. Lai, Z. Chen, and L. Yang, “Interactive two- stream decoder for accurate and fast saliency detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9141–9150
work page 2020
-
[74]
Single-image shadow detection and removal using paired regions,
R. Guo, Q. Dai, and D. Hoiem, “Single-image shadow detection and removal using paired regions,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2033–2040
work page 2011
-
[75]
Densely cascaded shadow detection network via deeply supervised parallel fu- sion
Y . Wang, X. Zhao, Y . Li, X. Hu, K. Huang, and N. CRIPAC, “Densely cascaded shadow detection network via deeply supervised parallel fu- sion.” inProceedings of the International Joint Conferences on Artificial Intelligence, 2018, p. 6
work page 2018
-
[76]
Fine-context shadow detection using shadow removal,
J. M. J. Valanarasu and V . M. Patel, “Fine-context shadow detection using shadow removal,” inProceedings of the IEEE Winter Conference on Applications of Computer Vision, 2023, pp. 1705–1714
work page 2023
-
[77]
Large- scale training of shadow detectors with noisily-annotated shadow exam- ples,
T. F. Y . Vicente, L. Hou, C.-P. Yu, M. Hoai, and D. Samaras, “Large- scale training of shadow detectors with noisily-annotated shadow exam- ples,” inProceedings of the European Conference on Computer Vision, 2016, pp. 816–832
work page 2016
-
[78]
O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 1, NO. 1, APRIL 2026 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
An overview of gradient descent optimization algorithms,
S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016
-
[80]
Efficient inference in fully connected crfs with gaussian edge potentials,
P. Kr ¨ahenb¨uhl and V . Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” inProceedings of the Advances in Neural Information Processing Systems, 2011, pp. 109–117
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.