Tips: Text-image pretraining with spatial awareness.arXiv:2410.16512

Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al · 2024 · arXiv 2410.16512

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

cs.RO · 2026-01-31 · unverdicted · novelty 6.0

CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.

Perception Encoder: The best visual embeddings are not at the output of the network

cs.CV · 2025-04-17 · unverdicted · novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

cs.CV · 2026-05-21 · unverdicted · novelty 5.0 · 2 refs

FlexiCT provides CT foundation models via agglomerative pretraining on 266227 volumes from 56 datasets that match or exceed task-specific models on five task families while organizing embeddings along tumor-stage gradients.

Multilingual Vision-Language Models, A Survey

cs.CL · 2025-09-26 · accept · novelty 3.0

The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.

citing papers explorer

Showing 5 of 5 citing papers.

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register cs.CV · 2026-05-19 · unverdicted · none · ref 19
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining cs.RO · 2026-01-31 · unverdicted · none · ref 34
CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.
Perception Encoder: The best visual embeddings are not at the output of the network cs.CV · 2025-04-17 · unverdicted · none · ref 90
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining cs.CV · 2026-05-21 · unverdicted · none · ref 9 · 2 links
FlexiCT provides CT foundation models via agglomerative pretraining on 266227 volumes from 56 datasets that match or exceed task-specific models on five task families while organizing embeddings along tumor-stage gradients.
Multilingual Vision-Language Models, A Survey cs.CL · 2025-09-26 · accept · none · ref 95
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.

Tips: Text-image pretraining with spatial awareness.arXiv:2410.16512

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer