Boxes2Pixels distills noisy SAM pseudo-masks into a compact DINOv2-based student with auxiliary localization and one-sided self-correction, delivering +6.97 anomaly mIoU and +9.71 binary IoU gains over baselines on wind turbine data with 80% fewer parameters.
Learning transferable visual models from natural language supervision
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3years
2026 3representative citing papers
DRG-Font generates stylistically consistent glyphs from few references by decomposing style and content via contrastive disentanglement, dynamic reference selection, and multi-scale fusion blocks.
Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.
citing papers explorer
-
Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks
Boxes2Pixels distills noisy SAM pseudo-masks into a compact DINOv2-based student with auxiliary localization and one-sided self-correction, delivering +6.97 anomaly mIoU and +9.71 binary IoU gains over baselines on wind turbine data with 80% fewer parameters.
-
DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement
DRG-Font generates stylistically consistent glyphs from few references by decomposing style and content via contrastive disentanglement, dynamic reference selection, and multi-scale fusion blocks.
-
Do Vision Language Models Need to Process Image Tokens?
Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.