arxiv: 2605.04593 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: unknown

DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

Zhiwei Yang , Pengfei Song , Yucong Meng , Kexue Fu , Shuo Wang , Zhijian Song

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords weakly supervised semantic segmentationCLIPdiffusion modelsclass activation mapsdense knowledge enhancementVCETSA

0 comments

The pith

Diffusion model integration refines CLIP features for superior weakly supervised segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces DiCLIP, a framework that uses a generative diffusion model to improve CLIP's performance in weakly supervised semantic segmentation. Previous approaches using CLIP for class activation maps suffer from limited spatial awareness in visual features and insufficient semantic coverage in text embeddings. The new method adds Visual Correlation Enhancement to inject diffusion-derived correlation maps into CLIP's attention for better discrimination and Text Semantic Augmentation to create a dynamic cache for retrieving visual knowledge to enrich text representations. If successful, this leads to more accurate pixel predictions from only image labels and lower overall training requirements on benchmarks like PASCAL VOC and MS COCO.

Core claim

DiCLIP enhances CLIP's dense knowledge across visual and text modalities by leveraging a diffusion model's generative capabilities. The Visual Correlation Enhancement module employs Attention Clustering Refinement to extract diverse correlation maps that bias CLIP's self-attention toward more discriminative distributions, addressing over-smoothing. The Text Semantic Augmentation module uses diffusion to maintain a dynamic key-value cache, transforming CAM generation into a visual knowledge retrieval process that better captures category variability.

What carries the argument

Visual Correlation Enhancement (VCE) with Attention Clustering Refinement (ACR) and Text Semantic Augmentation (TSA) with dynamic key-value cache, which transfer spatial consistency and generative power from the diffusion model to CLIP.

If this is right

Achieves higher segmentation accuracy than previous state-of-the-art methods on PASCAL VOC and MS COCO datasets.
Reduces the training costs associated with weakly supervised semantic segmentation.
Improves the quality of class activation maps by mitigating over-smoothing in attention mechanisms.
Enables a shift from simple patch-text matching to a more robust visual knowledge retrieval paradigm for dense predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method's reliance on diffusion models suggests potential applicability to other dense prediction tasks where vision-language models lack spatial precision.
Extending the dynamic cache idea could support online adaptation to new visual categories without full retraining.
Future work might test whether similar enhancements work with alternative generative models to reduce dependency on diffusion specifically.

Load-bearing premise

The spatial consistency and generative abilities of the diffusion model can be effectively transferred to CLIP's visual and text features without causing new artifacts or instability in the segmentation process.

What would settle it

Evaluating the full DiCLIP pipeline on the PASCAL VOC validation set and checking whether the mean intersection over union score surpasses current leading methods while measuring a reduction in GPU hours used for training.

Figures

Figures reproduced from arXiv: 2605.04593 by Kexue Fu, Pengfei Song, Shuo Wang, Yucong Meng, Zhijian Song, Zhiwei Yang.

**Figure 1.** Figure 1: Motivation of DiCLIP. (a) Previous WSSS methods solely rely on view at source ↗

**Figure 2.** Figure 2: Framework of DiCLIP. Our method leverages Stable Diffusion (SD) to enhance CLIP’s dense capability through offline and online phases. (1) view at source ↗

**Figure 3.** Figure 3: Attention maps from CLIP and Stable Diffusion (SD). For each model, view at source ↗

**Figure 4.** Figure 4: Illustration of Attention Clustering Refinement (ACR) module. view at source ↗

**Figure 5.** Figure 5: Qualitative segmentation comparisons on PASCAL VOC 2012. We compare DiCLIP with the recent CLIP-based state-of-the-art method WeCLIP [25] and two other advanced approaches, SeCo [36] and MoRe [73]. Small and off-center objects are highlighted with yellow rectangles, and failure cases are shown in the last two columns. It shows that DiCLIP achieves better segmentation performance even in challenging and fin… view at source ↗

**Figure 6.** Figure 6: Qualitative segmentation comparisons on MS COCO 2014 with WeCLIP [25], SeCo [36], and MoRe [73]. Small and off-center objects are highlighted with yellow rectangles, and failure cases are shown in the last two columns. Compared to other methods, DiCLIP produces more precise segmentation results. TABLE III CAM SEED WITHOUT POST-PROCESSING ON VOC TRAIN SET. M: MULTI-STAGED.†: OUR REPRODUCTION FOLLOWING OFFIC… view at source ↗

**Figure 7.** Figure 7: CAM visualizations on VOC train set, validating the efficacy of our module. (a) Image. (b-e) Qualitative ablation study of our key components. (e-h) Comparisons between (e) DiCLIP and recent CLIP-based counterparts, i.e., (f) WeCLIP [24], (g) CLIP-ES [40], and MaskCLIP [81]. (i) Ground truth mask. TABLE IV ABLATIVE STUDY OF OUR KEY COMPONENTS ON VOC VAL SET FOR SEGMENTATION. P.C.: POSITIVE CACHE. N.C.: NEG… view at source ↗

**Figure 9.** Figure 9: The effect of different α values on CLIP’s attention distribution, illustrating how SD’s attention α enhances CLIP’s attention. clustering groups B, the number of enhanced attention layers L in CLIP, the attention threshold ϵ for filtering noisy values, and the attention weight α in Eq. 7. The group number B determines the granularity of clustered semantics. As shown in Table VI (a), the performance drops … view at source ↗

**Figure 10.** Figure 10: Confusion ratio trends under increasing class co-occurrence on VOC val set. Our method achieves better performance in co-occurrence scenario. which are incorporated to enhance dynamic CAM generation. The results indicate that these learnable embeddings provide the most benefit when N = 92 (where 312 in brackets represents the total number of key-value pairs). In Table VI (g), we adjust the weight β to bal… view at source ↗

**Figure 11.** Figure 11: Visualization of feature representations. Three representative cases (a-c) and one small and off-center case (d) are illustrated. In each case, the upper row is the vanilla CLIP baseline, while the lower row shows ours. The t-SNE [83], queried attention maps, and affinity maps are used for illustration. most competing methods show a substantial rise in CR, whereas DiCLIP consistently maintains stable perf… view at source ↗

**Figure 13.** Figure 13: Comparison between ACR on CLIP’s attention and on SD’s attention, showing that SD’s localized attention enables more effective clustering. TABLE XI DISCUSSION OF ACR MODULE WITH Mt ON VOC TRAIN SET, VALIDATING THE SUPERIORITY OF OUR MODULE. Conditions Precision Recall mIoU On CLIP 30.2 47.3 22.1 On SD 83.9 83.0 72.0 (a) ACR ON CLIP VS. SD ATTENTION. Conditions Precision Recall mIoU Thresh. 83.7 82.5 71.5 … view at source ↗

**Figure 14.** Figure 14: t-SNE visualization of cache feature distributions for synthetic data from SD v2.1 and v1.5 versus real VOC images, showing that synthetic distributions are close to the real data. TABLE XIII COMPARISONS BETWEEN CACHES BUILT FROM THE VOC DATASET ( CACHE V) AND FROM SD ACROSS VARIOUS VERSIONS (CACHE S). Conditions Cache V Cache S Mt Mes Med Seg Real Data ✓ 72.0 74.3 78.2 77.2 SD v1.4 ✓ 71.6 73.4 77.7 76.7 … view at source ↗

read the original abstract

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically leverages Class Activation Maps (CAMs) to achieve pixel-level predictions. Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced to generate CAMs in WSSS. However, previous WSSS methods solely adopt CLIP's vision-language paired property for dense localization, neglecting its inherently limited dense knowledge across both visual and text modalities, which renders CAM generation suboptimal. In this work, we propose DiCLIP, a novel WSSS framework that leverages the generative diffusion model to enhance CLIP's dense knowledge across two modalities. Specifically, Visual Correlation Enhancement (VCE) and Text Semantic Augmentation (TSA) modules are proposed for dense prediction enhancement. To improve the spatial awareness of visual features, our VCE module utilizes diffusion's reliable spatial consistency to mitigate the over-smoothing issue in CLIP's attention. It designs the Attention Clustering Refinement (ACR) module to reliably extract diverse correlation maps from the diffusion model. The correlation maps act as a diversity bias for CLIP's self-attention, recursively pushing its visual features towards a more discriminative dense distribution. To augment the semantics of text embeddings, our TSA module argues that a single text modality is insufficient to encompass the variability of visual categories. Thus, we leverage diffusion's generative power to maintain a dynamic key-value cache model, shifting CAM generation from a patch-text matching mechanism to a novel visual knowledge retrieval paradigm. With these enhancements, DiCLIP not only outperforms state-of-the-art methods on PASCAL VOC and MS COCO but also significantly reduces training costs. Code is publicly available at https://github.com/zwyang6/DiCLIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiCLIP adds diffusion correlation maps and a dynamic visual cache to sharpen CLIP for WSSS, delivering claimed gains on VOC and COCO but leaving the lower-cost assertion unconvincing without numbers.

read the letter

The paper's core move is to use a frozen diffusion model to supply spatial correlation maps that refine CLIP's self-attention and to maintain a dynamic key-value cache that retrieves varied visual semantics for the text side. This produces two modules, VCE with its recursive ACR step and TSA, that shift CAM generation away from plain patch-text matching. The approach is new in how it pairs those two enhancements inside one framework for image-level WSSS. Results are reported as better than prior CLIP-based methods on the standard PASCAL VOC and MS COCO benchmarks, and the code is released, which lets others reproduce the pipeline quickly. That combination of targeted fixes and public implementation is the useful part. The soft spot is the efficiency claim. The abstract states that training costs drop significantly, yet the method inserts diffusion forward passes plus cache maintenance at each step. Those additions are not free, and no FLOPs, epoch counts, or wall-clock comparisons appear in the provided summary. If the full paper shows that the WSSS training loop itself runs faster or with simpler losses than the cited baselines, the claim can stand; otherwise it reads as overstated. The experimental section is also thin on ablations and variance in the abstract, so referees will need to see those tables. This paper is for people already working on CLIP adaptations or diffusion-assisted segmentation who want a concrete recipe to try. It is coherent on its own terms and targets a practical gap, so it is worth sending out for review. Ask the authors for the missing cost breakdown and a few more controls before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper proposes DiCLIP, a WSSS framework that augments CLIP with a frozen diffusion model via two new modules: Visual Correlation Enhancement (VCE) containing an Attention Clustering Refinement (ACR) submodule that injects diffusion-derived correlation maps as a diversity bias into CLIP self-attention, and Text Semantic Augmentation (TSA) that maintains a dynamic key-value cache of diffusion-generated visual features to shift CAM generation from patch-text matching to visual knowledge retrieval. The central claim is that these enhancements yield SOTA mIoU on PASCAL VOC and MS COCO while also lowering training costs relative to prior CLIP-based WSSS methods.

Significance. If the performance and cost claims are substantiated, the work would demonstrate a practical way to transfer spatial consistency and generative priors from diffusion models into CLIP-based dense prediction without full fine-tuning, potentially improving efficiency in the image-level supervision regime. Public code release supports reproducibility.

major comments (3)

[Abstract, §3] Abstract and §3 (Method): The headline claim that DiCLIP 'significantly reduces training costs' is load-bearing for the contribution yet is unsupported by any quantitative evidence. The VCE/ACR recursion, TSA dynamic cache, and repeated diffusion forward passes introduce non-trivial overhead even with a frozen backbone; without a table reporting wall-clock time, GPU-hours, or FLOPs versus the cited CLIP-CAM baselines (e.g., §4.2), it is impossible to verify whether the net training cost is lower.
[§4] §4 (Experiments): The outperformance claim on PASCAL VOC and MS COCO is central but the abstract and method description provide no dataset statistics, number of runs, error bars, or ablation isolating the contribution of ACR versus the dynamic cache. A single table of final mIoU numbers is insufficient to establish that the gains are robust and not due to hyper-parameter tuning or favorable splits.
[§3.2] §3.2 (ACR module): The recursive refinement of CLIP self-attention by diffusion correlation maps is described as 'pushing visual features towards a more discriminative dense distribution,' but no equation or convergence analysis is given for the recursion depth or the weighting between CLIP attention and the diffusion bias; without this, it is unclear whether the procedure is stable or merely adds another tunable hyper-parameter.

minor comments (2)

[§3.3] Notation for the dynamic key-value cache in TSA is introduced without a formal definition or update rule; a small diagram or pseudocode would clarify the retrieval step.
[§2] The paper cites several recent WSSS methods but does not discuss why diffusion was chosen over other generative priors (e.g., GANs or VAEs) that might incur lower inference cost.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity, rigor, and substantiation of our claims.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Method): The headline claim that DiCLIP 'significantly reduces training costs' is load-bearing for the contribution yet is unsupported by any quantitative evidence. The VCE/ACR recursion, TSA dynamic cache, and repeated diffusion forward passes introduce non-trivial overhead even with a frozen backbone; without a table reporting wall-clock time, GPU-hours, or FLOPs versus the cited CLIP-CAM baselines (e.g., §4.2), it is impossible to verify whether the net training cost is lower.

Authors: We agree that the claim requires quantitative backing, which is currently missing. Although the diffusion model remains frozen and CLIP is not fine-tuned (reducing overall optimization cost relative to prior methods that train additional heads or adapters), the added modules incur overhead. We will insert a new table in §4 reporting wall-clock training time, peak GPU memory, and approximate FLOPs for DiCLIP versus the CLIP-CAM baselines cited in §4.2. This will allow direct verification of the net cost reduction. revision: yes
Referee: [§4] §4 (Experiments): The outperformance claim on PASCAL VOC and MS COCO is central but the abstract and method description provide no dataset statistics, number of runs, error bars, or ablation isolating the contribution of ACR versus the dynamic cache. A single table of final mIoU numbers is insufficient to establish that the gains are robust and not due to hyper-parameter tuning or favorable splits.

Authors: We acknowledge the need for greater statistical detail. The current §4 contains component ablations, yet they do not report multiple random seeds, standard deviations, or fully isolate ACR from the TSA cache. We will expand the experimental section to include: (i) explicit dataset statistics, (ii) mean mIoU and standard deviation over at least three independent runs, and (iii) additional ablations that separately disable ACR and the dynamic cache while keeping all other factors fixed. These additions will demonstrate robustness beyond a single table of final scores. revision: yes
Referee: [§3.2] §3.2 (ACR module): The recursive refinement of CLIP self-attention by diffusion correlation maps is described as 'pushing visual features towards a more discriminative dense distribution,' but no equation or convergence analysis is given for the recursion depth or the weighting between CLIP attention and the diffusion bias; without this, it is unclear whether the procedure is stable or merely adds another tunable hyper-parameter.

Authors: We thank the referee for highlighting this omission. The ACR recursion is presented descriptively but lacks a formal update rule. We will add the precise mathematical formulation of the refined self-attention (including the weighting coefficient between the original CLIP attention and the diffusion-derived bias) together with the stopping criterion for recursion depth. We will also include an ablation table showing mIoU sensitivity to recursion depth (1–4 iterations) and discuss empirical stability; the diffusion maps are fixed and therefore do not introduce new trainable parameters beyond the existing weighting scalar. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes DiCLIP as a framework that introduces VCE (with ACR module) and TSA (with dynamic key-value cache) modules to leverage diffusion model properties for enhancing CLIP's dense knowledge in WSSS. No equations, mathematical derivations, or self-referential definitions appear in the abstract or method summary that would equate any claimed prediction or result to its inputs by construction. Performance gains and cost reductions are presented as empirical outcomes of the proposed enhancements rather than tautological or fitted-by-design quantities. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are evident. The derivation chain remains self-contained against external benchmarks and properties of the base models.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 4 invented entities

The approach rests on the domain assumption that diffusion models possess reliable spatial consistency and generative power that can be extracted and transferred to CLIP without new biases. New modules (VCE, TSA, ACR) and a dynamic cache are introduced as the core additions.

axioms (2)

domain assumption CLIP has inherently limited dense knowledge across visual and text modalities
Stated directly in the abstract as the motivation for enhancement
domain assumption Diffusion model provides reliable spatial consistency and generative power usable for dense prediction
Invoked to justify VCE and TSA modules

invented entities (4)

Visual Correlation Enhancement (VCE) module no independent evidence
purpose: Mitigate over-smoothing in CLIP attention using diffusion correlation maps
New module proposed to extract and apply diversity bias
Text Semantic Augmentation (TSA) module no independent evidence
purpose: Augment text embeddings via dynamic key-value cache from diffusion
New module to shift from patch-text matching to visual knowledge retrieval
Attention Clustering Refinement (ACR) module no independent evidence
purpose: Reliably extract diverse correlation maps from diffusion model
Component inside VCE for processing diffusion outputs
dynamic key-value cache model no independent evidence
purpose: Maintain variability of visual categories for text semantics
New paradigm for CAM generation

pith-pipeline@v0.9.0 · 5633 in / 1571 out tokens · 47637 ms · 2026-05-08T18:29:00.050598+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Fully convolutional networks for semantic segmentation,

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inCVPR, 2015, pp. 3431–3440

2015
[2]

Segformer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”NeurIPS, vol. 34, pp. 12 077–12 090, 2021

2021
[3]

Polyp-mamba: Polyp segmentation with visual mamba,

Z. Xu, F. Tang, Z. Chen, Z. Zhou, W. Wu, Y . Yang, Y . Liang, J. Jiang, X. Cai, and J. Su, “Polyp-mamba: Polyp segmentation with visual mamba,” inMICCAI. Springer, 2024

2024
[4]

Rethinking semantic segmentation with multi-grained logical prototype,

A. Yu, K. Gao, X. You, Y . Zhong, Y . Su, B. Liu, and C. Qiu, “Rethinking semantic segmentation with multi-grained logical prototype,”IEEE Transactions on Image Processing, 2025

2025
[5]

Haformer: Unleashing the power of hierarchy-aware features for lightweight semantic segmenta- tion,

G. Xu, W. Jia, T. Wu, L. Chen, and G. Gao, “Haformer: Unleashing the power of hierarchy-aware features for lightweight semantic segmenta- tion,”IEEE Transactions on Image Processing, 2024

2024
[6]

What’s the point: Semantic segmentation with point supervision,

A. Bearman, O. Russakovsky, V . Ferrari, and L. Fei-Fei, “What’s the point: Semantic segmentation with point supervision,” inCom- puter Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer, 2016, pp. 549–565

2016
[7]

Scribblesup: Scribble- supervised convolutional networks for semantic segmentation,

D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble- supervised convolutional networks for semantic segmentation,” inCVPR, 2016, pp. 3159–3167

2016
[8]

Learning random-walk label propaga- tion for weakly-supervised semantic segmentation,

P. Vernaza and M. Chandraker, “Learning random-walk label propaga- tion for weakly-supervised semantic segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7158–7166

2017
[9]

Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,

J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” inICCV, 2015, pp. 1635–1643

2015
[10]

Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,

J. Lee, J. Yi, C. Shin, and S. Yoon, “Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,” in CVPR, 2021, pp. 2643–2652

2021
[11]

More: Class patch attention needs regularization for weakly supervised semantic segmentation,

Z. Yang, Y . Meng, K. Fu, S. Wang, and Z. Song, “More: Class patch attention needs regularization for weakly supervised semantic segmentation,”arXiv preprint arXiv:2412.11076, 2024

work page arXiv 2024
[12]

Learning pixel-level semantic affinity with image- level supervision for weakly supervised semantic segmentation,

J. Ahn and S. Kwak, “Learning pixel-level semantic affinity with image- level supervision for weakly supervised semantic segmentation,” in CVPR, 2018, pp. 4981–4990

2018
[13]

Learning affinity from attention: end- to-end weakly-supervised semantic segmentation with transformers,

L. Ru, Y . Zhan, B. Yu, and B. Du, “Learning affinity from attention: end- to-end weakly-supervised semantic segmentation with transformers,” in CVPR, 2022, pp. 16 846–16 855

2022
[14]

From image-level to pixel-level labeling with convolutional networks,

P. O. Pinheiro and R. Collobert, “From image-level to pixel-level labeling with convolutional networks,” inCVPR, 2015, pp. 1713–1721

2015
[15]

Tackling ambi- guity from perspective of uncertainty inference and affinity diversifi- cation for weakly supervised semantic segmentation,

Z. Yang, Y . Meng, K. Fu, S. Wang, and Z. Song, “Tackling ambi- guity from perspective of uncertainty inference and affinity diversifi- cation for weakly supervised semantic segmentation,”arXiv preprint arXiv:2404.08195, 2024

work page arXiv 2024
[16]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10 684–10 695

2022
[17]

Learning deep features for discriminative localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921– 2929

2016
[18]

Dual graph inference network for weakly supervised semantic segmentation,

J. Zhang, B. Peng, and X. Wu, “Dual graph inference network for weakly supervised semantic segmentation,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

2025
[19]

Class activation map calibration for weakly supervised seman- tic segmentation,

J. Wang, T. Dai, X. Zhao, ´A. F. Garc ´ıa-Fern´andez, E. G. Lim, and J. Xiao, “Class activation map calibration for weakly supervised seman- tic segmentation,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

2024
[20]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

2021
[21]

Weakly-supervised semantic segmentation by iteratively mining common object features,

X. Wang, S. You, X. Li, and H. Ma, “Weakly-supervised semantic segmentation by iteratively mining common object features,” inCVPR, 2018, pp. 1354–1362

2018
[22]

Clims: cross language image matching for weakly supervised semantic segmentation,

J. Xie, X. Hou, K. Ye, and L. Shen, “Clims: cross language image matching for weakly supervised semantic segmentation,” inCVPR, 2022, pp. 4483–4492

2022
[23]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626

2017
[24]

Frozen clip: A strong backbone for weakly supervised semantic segmentation,

B. Zhang, S. Yu, Y . Wei, Y . Zhao, and J. Xiao, “Frozen clip: A strong backbone for weakly supervised semantic segmentation,” inCVPR, 2024, pp. 3796–3806

2024
[25]

Weakclip: Adapting clip for weakly-supervised semantic segmentation,

L. Zhu, X. Wang, J. Feng, T. Cheng, Y . Li, B. Jiang, D. Zhang, and J. Han, “Weakclip: Adapting clip for weakly-supervised semantic segmentation,”International Journal of Computer Vision, pp. 1–21, 2024

2024
[26]

Unbounded cache model for online language modeling with open vocabulary,

E. Grave, M. M. Cisse, and A. Joulin, “Unbounded cache model for online language modeling with open vocabulary,”Advances in neural information processing systems, vol. 30, 2017

2017
[27]

Tip-adapter: Training-free adaption of clip for few-shot classification,

R. Zhang, W. Zhang, R. Fang, P. Gao, K. Li, J. Dai, Y . Qiao, and H. Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” pp. 493–510, 2022

2022
[28]

Weakly supervised semantic segmentation via alternate self- dual teaching,

D. Zhang, H. Li, W. Zeng, C. Fang, L. Cheng, M.-M. Cheng, and J. Han, “Weakly supervised semantic segmentation via alternate self- dual teaching,”IEEE Transactions on Image Processing, 2023

2023
[29]

Multi-granularity denoising and bidirec- tional alignment for weakly supervised semantic segmentation,

T. Chen, Y . Yao, and J. Tang, “Multi-granularity denoising and bidirec- tional alignment for weakly supervised semantic segmentation,”IEEE Transactions on Image Processing, vol. 32, pp. 2960–2971, 2023

2023
[30]

Weakly supervised semantic segmentation by pixel-to-prototype contrast,

Y . Du, Z. Fu, Q. Liu, and Y . Wang, “Weakly supervised semantic segmentation by pixel-to-prototype contrast,” inCVPR, June 2022, pp. 4320–4329

2022
[31]

Extracting class activation maps from non- discriminative features as well,

Z. Chen and Q. Sun, “Extracting class activation maps from non- discriminative features as well,” inCVPR, June 2023, pp. 3135–3144

2023
[32]

Cross-block sparse class token contrast for weakly supervised semantic segmentation,

K. Cheng, J. Tang, H. Gu, H. Wan, and M. Li, “Cross-block sparse class token contrast for weakly supervised semantic segmentation,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

2024
[33]

Fine-grained background representation for weakly supervised semantic segmenta- tion,

X. Yin, W. Im, D. Min, Y . Huo, F. Pan, and S.-E. Yoon, “Fine-grained background representation for weakly supervised semantic segmenta- tion,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

2024
[34]

Pixel-level domain adaptation: A new perspective for enhancing weakly supervised semantic segmentation,

Y . Du, Z. Fu, and Q. Liu, “Pixel-level domain adaptation: A new perspective for enhancing weakly supervised semantic segmentation,” IEEE Transactions on Image Processing, 2024

2024
[35]

Spatial structure constraints for weakly supervised semantic segmentation,

T. Chen, Y . Yao, X. Huang, Z. Li, L. Nie, and J. Tang, “Spatial structure constraints for weakly supervised semantic segmentation,” IEEE Transactions on Image Processing, vol. 33, pp. 1136–1148, 2024

2024
[36]

Separate and conquer: Decoupling co-occurrence via decomposition and repre- sentation for weakly supervised semantic segmentation,

Z. Yang, K. Fu, M. Duan, L. Qu, S. Wang, and Z. Song, “Separate and conquer: Decoupling co-occurrence via decomposition and repre- sentation for weakly supervised semantic segmentation,” inCVPR, June 2024, pp. 3606–3615

2024
[37]

Token contrast for weakly- supervised semantic segmentation,

L. Ru, H. Zheng, Y . Zhan, and B. Du, “Token contrast for weakly- supervised semantic segmentation,” pp. 3093–3102, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, MARCH 2025 16

2023
[38]

Group-wise semantic mining for weakly supervised semantic segmentation,

X. Li, T. Zhou, J. Li, Y . Zhou, and Z. Zhang, “Group-wise semantic mining for weakly supervised semantic segmentation,” inAAAI, vol. 35, no. 3, 2021, pp. 1984–1992

2021
[39]

Uncovering prototypical knowledge for weakly open- vocabulary semantic segmentation,

F. Zhang, T. Zhou, B. Li, H. He, C. Ma, T. Zhang, J. Yao, Y . Zhang, and Y . Wang, “Uncovering prototypical knowledge for weakly open- vocabulary semantic segmentation,”Advances in Neural Information Processing Systems, vol. 36, pp. 73 652–73 665, 2023

2023
[40]

Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,

Y . Lin, M. Chen, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He, “Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,” pp. 15 305–15 314, 2023

2023
[41]

Question-answer cross language image matching for weakly supervised semantic segmentation,

S. Deng, W. Zhuo, J. Xie, and L. Shen, “Question-answer cross language image matching for weakly supervised semantic segmentation,”arXiv preprint arXiv:2401.09883, 2024

work page arXiv 2024
[42]

Prompting classes: exploring the power of prompt class learning in weakly supervised semantic segmentation,

B. Murugesan, R. Hussain, R. Bhattacharya, I. Ben Ayed, and J. Dolz, “Prompting classes: exploring the power of prompt class learning in weakly supervised semantic segmentation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 291–302

2024
[43]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international con- ference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241

2015
[44]

Unleashing text- to-image diffusion models for visual perception,

W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text- to-image diffusion models for visual perception,” inICCV, 2023, pp. 5729–5739

2023
[45]

Diffusion models for open-vocabulary segmentation,

L. Karazija, I. Laina, A. Vedaldi, and C. Rupprecht, “Diffusion models for open-vocabulary segmentation,” inEuropean Conference on Com- puter Vision. Springer, 2024, pp. 299–317

2024
[46]

Diffusion- guided weakly supervised semantic segmentation,

S.-H. Yoon, H. Kwon, J. Jeong, D. Park, and K.-J. Yoon, “Diffusion- guided weakly supervised semantic segmentation,” inEuropean Confer- ence on Computer Vision. Springer, 2024, pp. 393–411

2024
[47]

Exploring limits of diffusion-synthetic training with weakly supervised semantic segmentation,

R. Yoshihashi, Y . Otsuka, T. Tanaka, H. Kataokaet al., “Exploring limits of diffusion-synthetic training with weakly supervised semantic segmentation,” inACCV, 2024, pp. 2300–2318

2024
[48]

Image augmentation with controlled diffusion for weakly-supervised semantic segmentation,

W. Wu, T. Dai, X. Huang, F. Ma, and J. Xiao, “Image augmentation with controlled diffusion for weakly-supervised semantic segmentation,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 6175–6179

2024
[49]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page Pith review arXiv 2010
[50]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,”arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review arXiv 2016
[51]

Diffusion model is secretly a training-free open vocabulary semantic segmenter,

J. Wang, X. Li, J. Zhang, Q. Xu, Q. Zhou, Q. Yu, L. Sheng, and D. Xu, “Diffusion model is secretly a training-free open vocabulary semantic segmenter,”IEEE Transactions on Image Processing, 2025

2025
[52]

K-means text dynamic clustering algorithm based on kl divergence,

Z. Huan, Z. Pengzhou, and G. Zeyang, “K-means text dynamic clustering algorithm based on kl divergence,” in2018 IEEE/ACIS 17th Interna- tional Conference on Computer and Information Science (ICIS). IEEE, 2018, pp. 659–663

2018
[53]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

2022
[54]

Least squares quantization in pcm,

S. Lloyd, “Least squares quantization in pcm,”IEEE transactions on information theory, vol. 28, no. 2, pp. 129–137, 1982

1982
[55]

Sfc: Shared feature calibration in weakly supervised semantic segmentation,

X. Zhao, F. Tang, X. Wang, and J. Xiao, “Sfc: Shared feature calibration in weakly supervised semantic segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7525–7533

2024
[56]

Class tokens infusion for weakly supervised semantic segmentation,

S.-H. Yoon, H. Kwon, H. Kim, and K.-J. Yoon, “Class tokens infusion for weakly supervised semantic segmentation,” inCVPR, 2024, pp. 3595–3605

2024
[57]

Hunting attributes: Context prototype-aware learning for weakly supervised semantic seg- mentation,

F. Tang, Z. Xu, Z. Qu, W. Feng, X. Jiang, and Z. Ge, “Hunting attributes: Context prototype-aware learning for weakly supervised semantic seg- mentation,” inCVPR, 2024, pp. 3324–3334

2024
[58]

Pot: Prototypical optimal transport for weakly supervised semantic segmen- tation,

J. Wang, T. Dai, B. Zhang, S. Yu, E. G. Lim, and J. Xiao, “Pot: Prototypical optimal transport for weakly supervised semantic segmen- tation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15 055–15 064

2025
[59]

Efficient inference in fully connected crfs with gaussian edge potentials,

P. Kr ¨ahenb¨uhl and V . Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,”NeurIPS, vol. 24, 2011

2011
[60]

L2g: A simple local-to- global knowledge transfer framework for weakly supervised semantic segmentation,

P.-T. Jiang, Y . Yang, Q. Hou, and Y . Wei, “L2g: A simple local-to- global knowledge transfer framework for weakly supervised semantic segmentation,” inCVPR, 2022, pp. 16 886–16 896

2022
[61]

Regional semantic contrast and aggregation for weakly supervised semantic segmentation,

T. Zhou, M. Zhang, F. Zhao, and J. Li, “Regional semantic contrast and aggregation for weakly supervised semantic segmentation,” inCVPR, 2022, pp. 4299–4309

2022
[62]

Treating pseudo- labels generation as image matting for weakly supervised semantic segmentation,

C. Wang, R. Xu, S. Xu, W. Meng, and X. Zhang, “Treating pseudo- labels generation as image matting for weakly supervised semantic segmentation,” inICCV, 2023, pp. 755–765

2023
[63]

Weakly supervised semantic segmentation using out-of-distribution data,

J. Lee, S. J. Oh, S. Yun, J. Choe, E. Kim, and S. Yoon, “Weakly supervised semantic segmentation using out-of-distribution data,” in CVPR, 2022, pp. 16 897–16 906

2022
[64]

Fpr: False positive rectification for weakly supervised semantic segmentation,

L. Chen, C. Lei, R. Li, S. Li, Z. Zhang, and L. Zhang, “Fpr: False positive rectification for weakly supervised semantic segmentation,” in ICCV, 2023, pp. 1108–1118

2023
[65]

Boundary-enhanced co-training for weakly supervised semantic segmentation,

S. Rong, B. Tu, Z. Wang, and J. Li, “Boundary-enhanced co-training for weakly supervised semantic segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 574–19 584

2023
[66]

Mctformer+: Multi-class token transformer for weakly supervised semantic segmentation,

L. Xu, M. Bennamoun, F. Boussaid, H. Laga, W. Ouyang, and D. Xu, “Mctformer+: Multi-class token transformer for weakly supervised semantic segmentation,”IEEE transactions on pattern analysis and machine intelligence, 2024

2024
[67]

Psdpm: Prototype- based secondary discriminative pixels mining for weakly supervised semantic segmentation,

X. Zhao, Z. Yang, T. Dai, B. Zhang, and J. Xiao, “Psdpm: Prototype- based secondary discriminative pixels mining for weakly supervised semantic segmentation,” inCVPR, June 2024, pp. 3437–3446

2024
[68]

Dupl: Dual student with trustworthy progressive learning for robust weakly supervised semantic segmentation,

Y . Wu, X. Ye, K. Yang, J. Li, and X. Li, “Dupl: Dual student with trustworthy progressive learning for robust weakly supervised semantic segmentation,” inCVPR, June 2024, pp. 3534–3543

2024
[69]

Ffr: Frequency feature rectification for weakly supervised semantic segmentation,

Z. Yang, X. Zhao, X. Wang, Q. Zhang, and J. Xiao, “Ffr: Frequency feature rectification for weakly supervised semantic segmentation,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 30 261–30 270

2025
[70]

Weakly supervised semantic segmentation via pro- gressive confidence region expansion,

X. Xu, P. Zhang, W. Huang, Y . Shen, H. Chen, J. Lin, W. Li, G. He, J. Xie, and S. Lin, “Weakly supervised semantic segmentation via pro- gressive confidence region expansion,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9829–9838

2025
[71]

Multi-label prototype visual spatial search for weakly supervised semantic segmentation,

S. Duan, X. Yang, and N. Wang, “Multi-label prototype visual spatial search for weakly supervised semantic segmentation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 30 241–30 250

2025
[72]

Know your attention maps: Class-specific token masking for weakly supervised semantic segmentation,

J. Hanna and D. Borth, “Know your attention maps: Class-specific token masking for weakly supervised semantic segmentation,”arXiv preprint arXiv:2507.06848, 2025

work page arXiv 2025
[73]

More: Class patch attention needs regularization for weakly supervised semantic segmentation,

Z. Yang, Y . Meng, K. Fu, S. Wang, and Z. Song, “More: Class patch attention needs regularization for weakly supervised semantic segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9400–9408

2025
[74]

Dial: Dense image- text alignment for weakly supervised semantic segmentation,

S. Jang, J. Yun, J. Kwon, E. Lee, and Y . Kim, “Dial: Dense image- text alignment for weakly supervised semantic segmentation,” inECCV. Springer, 2024, pp. 248–266

2024
[75]

Usage: A unified seed area generation paradigm for weakly supervised semantic segmentation,

Z. Peng, G. Wang, L. Xie, D. Jiang, W. Shen, and Q. Tian, “Usage: A unified seed area generation paradigm for weakly supervised semantic segmentation,” inICCV, 2023, pp. 624–634

2023
[76]

Out-of-candidate rectification for weakly supervised semantic segmentation,

Z. Cheng, P. Qiao, K. Li, S. Li, P. Wei, X. Ji, L. Yuan, C. Liu, and J. Chen, “Out-of-candidate rectification for weakly supervised semantic segmentation,” inCVPR, 2023, pp. 23 673–23 684

2023
[77]

Class re-activation maps for weakly-supervised semantic segmentation,

Z. Chen, T. Wang, X. Wu, X.-S. Hua, H. Zhang, and Q. Sun, “Class re-activation maps for weakly-supervised semantic segmentation,” in CVPR, 2022, pp. 969–978

2022
[78]

Max pooling with vision transformers reconciles class and shape in weakly supervised semantic segmentation,

S. Rossetti, D. Zappia, M. Sanzari, M. Schaerf, and F. Pirri, “Max pooling with vision transformers reconciles class and shape in weakly supervised semantic segmentation,” inEuropean conference on computer vision. Springer, 2022, pp. 446–463

2022
[79]

The pascal visual object classes challenge: A retrospective,

M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,”IJCV, vol. 111, pp. 98–136, 2015

2015
[80]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inECCV. Springer, 2014, pp. 740–755

2014

Showing first 80 references.