super hub Mixed citations

Learning Transferable Visual Models From Natural Language Supervision

Aditya Ramesh, Alec Radford, Chris Hallacy, Gabriel Goh, Jong Wook Kim, Sandhini Agarwal · 2021 · cs.CV · arXiv 2103.00020

Mixed citation behavior. Most common role is background (69%).

265 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 265 citing papers more from Aditya Ramesh arXiv PDF

abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 method 8 baseline 4 other 1

citation-polarity summary

background 34 use method 8 baseline 4 unclear 2 support 1

claims ledger

abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i

authors

Aditya Ramesh Alec Radford Chris Hallacy Gabriel Goh Jong Wook Kim Sandhini Agarwal

co-cited works

representative citing papers

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

cs.CV · 2026-06-29 · accept · novelty 8.0

MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.

Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.

Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

PuMVR benchmark shows VLMs exhibit script-dependent bias on Punjabi tasks with accuracy gaps up to 16% and script consistency rates as low as 24.8%, even when visual input is provided.

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

Introduces Forged Calamity benchmark and shows that fine-tuned and zero-shot synthetic image detectors lose substantial accuracy on unseen generators and disaster types.

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

cs.CV · 2026-06-15 · conditional · novelty 7.0

A new benchmark for Punjabi reveals VLMs have large script-dependent performance gaps on identical tasks, with consistency as low as 24.8 percent.

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

VISA improves closed-set 3D occupancy mIoU on nuScenes by using VLM instance audits as reliability-weighted semantic supervisors during training of existing world models.

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

Net-Ev² proposes a two-stage generative simulator with structure-guided masked pre-training and topology-aware diffusion using graph U-Net down/upsampling to model network event evolution from text inputs, plus a new 6.5M multimodal benchmark and JL-MMD metric.

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

cs.LG · 2026-06-09 · accept · novelty 7.0

A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

cs.DB · 2026-06-08 · unverdicted · novelty 7.0

ArtiFact is a new multi-modal dataset of 651k museum records used to benchmark cross-modal error detection with seven error categories and semantic query processing challenges.

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Smaller self-supervised ViTs localize objects better via attention than larger ViTs, enabling A² to decouple localization from feature extraction for competitive performance on distribution-shifted benchmarks.

The Regularizing Power of Language-Training Deepfake Detectors

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Dex2HOI is a dual-stream diffusion model with bidirectional cross-attention and motion fusion that generates long bimanual single- and two-object HOI sequences from text at real-time speeds.

citing papers explorer

Showing 50 of 265 citing papers.

Brain-to-Image Retrieval and Reconstruction via Multimodal EEG Alignment cs.CV · 2026-05-18 · unverdicted · none · ref 8 · internal anchor
A multimodal alignment pipeline decodes EEG signals recorded during natural image viewing into image retrieval (86.3% Top-1) and reconstruction (CLIP 0.903) tasks.
Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC cs.CV · 2026-05-16 · unverdicted · none · ref 19 · internal anchor
A pipeline that converts body-worn camera footage into labeled visual timelines by classifying 10-second windows along operational-context and motion-intensity axes using CLIP and optical-flow features.
Rethinking the Good Enough Embedding for Easy Few-Shot Learning cs.CV · 2026-05-13 · conditional · none · ref 24 · internal anchor
Frozen DINOv2-L features with k-NN classification and PCA/ICA refinement achieve state-of-the-art few-shot performance on four benchmarks without any backpropagation or fine-tuning.
Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition cs.CV · 2026-05-11 · unverdicted · none · ref 13 · internal anchor
A fine-tuned large language-vision model achieves 98% accuracy on visual question answering for military vehicle identification in SAR imagery from an extended MSTAR benchmark.
Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning cs.LG · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
Finite-sample noise collapses the eigengap in representation covariances limiting recoverable modes K(N); multimodal learning stabilizes it via low-rank constraints, yielding better class separation quantified by truncated Mahalanobis energy approximated with a zeta function.
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs cs.CV · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.
Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response cs.CV · 2026-05-06 · unverdicted · none · ref 2 · 3 links · internal anchor
GeoQuery is a zero-shot retrieval system that optimizes text prompts on a proxy subset so language embeddings correlate with frozen CLAY visual embeddings, then performs text search followed by visual nearest-neighbor lookup, reaching 31.6% accuracy within 50 km on 76 disaster queries.
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers cs.RO · 2026-05-05 · unverdicted · none · ref 4 · internal anchor
FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.
Bolek: A Multimodal Language Model for Molecular Reasoning cs.LG · 2026-05-04 · unverdicted · none · ref 21 · internal anchor
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language cs.AI · 2026-04-22 · unverdicted · none · ref 10 · internal anchor
InVitroVision, a fine-tuned PaliGemma-2 model, generates natural language descriptions of embryo development and outperforms ChatGPT 5.2 and base models on a public time-lapse dataset.
Style-Based Neural Architectures for Real-Time Weather Classification cs.CV · 2026-04-20 · unverdicted · none · ref 20 · internal anchor
Three style-based neural architectures are proposed for real-time weather classification from images, with two truncated ResNet variants claimed to outperform prior methods and generalize across public datasets.
UniMesh: Unifying 3D Mesh Understanding and Generation cs.CV · 2026-04-19 · unverdicted · none · ref 38 · internal anchor
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
From edges to meaning: Semantic line sketches as a cognitive scaffold for ancient pictograph invention cs.AI · 2026-04-14 · unverdicted · none · ref 1 · internal anchor
A biologically inspired AI model of the visual hierarchy generates contour sketches that structurally resemble ancient pictographs from Egyptian, Chinese, and other writing systems.
Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection cs.CV · 2026-04-10 · unverdicted · none · ref 19 · internal anchor
Optimized prompts for vision foundation models improve cowpea detection accuracy by over 0.35 mAP on synthetic data and transfer effectively to real fields without manual annotations.
Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift cs.CV · 2026-04-10 · unverdicted · none · ref 6 · internal anchor
Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.
From Perception to Autonomous Computational Modeling: A Multi-Agent Approach cs.CE · 2026-04-08 · unverdicted · none · ref 67 · internal anchor
A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.
Woosh: A Sound Effects Foundation Model cs.SD · 2026-04-02 · accept · none · ref 28 · internal anchor
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
Perceptual misalignment of texture representations in convolutional neural networks cs.CV · 2026-04-01 · unverdicted · none · ref 68 · internal anchor
No correlation exists between CNNs' Brain-Score alignment with the visual system and the perceptual content of their Gram-matrix texture representations.
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity cs.AI · 2026-03-03 · unverdicted · none · ref 72 · internal anchor
Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection cs.CV · 2026-02-10 · unverdicted · none · ref 14 · internal anchor
HLGFA detects anomalies by identifying breakdowns in cross-resolution feature consistency between high- and low-resolution views of normal samples, guided by structure and detail priors, and reports 97.9% pixel AUROC on MVTec AD.
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models cs.CV · 2026-01-29 · unverdicted · none · ref 71 · internal anchor
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification cs.CL · 2025-12-08 · unverdicted · none · ref 29 · internal anchor
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
Optical Context Compression Is Just (Bad) Autoencoding cs.CV · 2025-12-03 · accept · none · ref 12 · internal anchor
Vision-based optical context compression performs no better than direct autoencoding baselines like mean pooling or hierarchical encoders across compression ratios.
Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning cs.LG · 2025-06-11 · unverdicted · none · ref 43 · internal anchor
BYOL-γ uses self-predictive representations to approximate successor representations, improving zero-shot combinatorial generalization in goal-conditioned behavioral cloning.
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration cs.CV · 2025-05-27 · unverdicted · none · ref 14 · internal anchor
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography cs.CV · 2025-02-04 · unverdicted · none · ref 3 · internal anchor
A 3D self-supervised foundation model trained on over 360k head CT scans improves downstream disease classification on limited-label internal and external datasets versus scratch-trained and prior models.
DetailCLIP: Injecting Image Details into CLIP's Feature Space cs.CV · 2022-08-31 · unverdicted · none · ref 22 · internal anchor
A patch-based fusion method extends CLIP to high-resolution images by retaining multi-scale details for improved class-prompted retrieval.
At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization cs.LG · 2026-06-24 · unverdicted · none · ref 26 · internal anchor
Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.
A Benchmark of (MRI-) Foundation Models to Predict IDH Mutational Status in Glioma eess.IV · 2026-06-22 · unverdicted · none · ref 22 · 2 links · internal anchor
TabPFN on radiomic features matched or outperformed image foundation models for IDH mutational status prediction in glioma MRI, with BiomedCLIP strongest among visual encoders and performance sensitive to cohort shifts and calibration.
Fail-RAG : A Retrieval Augmented Generation Informed Framework for Robot Failure Identification cs.RO · 2026-06-17 · unverdicted · none · ref 18 · internal anchor
Fail-RAG is a retrieval-augmented generation framework that detects and describes robot failures in warehouse tasks by querying an embedded failure database and applying VLMs, showing 25 percentage point higher accuracy than off-the-shelf VLMs.
EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation cs.RO · 2026-06-17 · unverdicted · none · ref 17 · internal anchor
EffiNav combines depth and vision-language inputs for efficient object goal navigation, matching or exceeding baselines on success rate and path-length-weighted success across simulation benchmarks and real-robot tests.
Multimodal Image Colorization: Quantifying the Impact of Text-Conditioned Guidance on Grayscale-to-Color Translation cs.GR · 2026-06-16 · unverdicted · none · ref 11 · internal anchor
Text conditioning improves PSNR by ~5.7%, SSIM by ~1.4%, colorfulness by up to 36.6%, and reduces LPIPS by ~9.5% across U-Net and Stable Diffusion colorization models.
Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing cs.CV · 2026-06-11 · unverdicted · none · ref 8 · internal anchor
Custom ZeroCLIP uses retrieval from seen provinces to caption traditional Indonesian clothing images from 8 unseen provinces, achieving CLIPScore 0.8536, BLEU-4 0.3342, and METEOR 0.4859 while outperforming baselines.
Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation cs.CV · 2026-06-05 · unverdicted · none · ref 44 · internal anchor
Early DC component convergence in text-to-image Transformer features causes output homogeneity; selective early attenuation via DAVE improves diversity without retraining or extra cost.
Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient cs.RO · 2026-05-26 · unverdicted · none · ref 45 · internal anchor
SDPG is a new on-policy visual RL algorithm that estimates gradients via stochastic perturbations of rollouts, achieving faster training and lower memory use than baselines on visual MuJoCo tasks while adding new robotics benchmarks and sim-to-real results.
A Gaia-linked High-purity QSO Candidate Catalog in Selected Fields with Extinction-binned Calibration and Spectrum-informed Training astro-ph.IM · 2026-05-22 · unverdicted · none · ref 51 · internal anchor
The P3 selector achieves 0.9809 purity and 0.8869 completeness for QSO candidates in selected fields, outperforming Gaia's official probabilities.
Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation q-bio.BM · 2026-05-12 · unverdicted · none · ref 39 · internal anchor
Moirain models use multimodal SFT and DPO to generate novel RNA sequences with superior protein binding affinities in a zero-shot conditional setting.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception cs.RO · 2026-05-11 · unverdicted · none · ref 54 · 2 links · internal anchor
StereoPolicy fuses left-right image features via cross-attention to deliver consistent gains over RGB, RGB-D, point cloud, and multi-view baselines in simulation and real-robot manipulation tasks.
LLM-Enhanced Topical Trend Detection at Snapchat cs.IR · 2026-04-29 · unverdicted · none · ref 17 · internal anchor
Snapchat's deployed system detects emerging topical trends in short videos via multimodal extraction, time-series burst detection, and LLM consolidation, achieving high precision per six months of human evaluation and improving content freshness in production.
ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification cs.LG · 2026-04-20 · unverdicted · none · ref 9 · internal anchor
ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.
Transparent and Controllable Recommendation Filtering via Multimodal Multi-Agent Collaboration cs.IR · 2026-04-19 · unverdicted · none · ref 23 · internal anchor
A multi-agent multimodal system with fact-grounded adjudication and a dynamic two-tier preference graph cuts false positives in content filtering by 74.3% and nearly doubles F1-score versus text-only baselines while supporting user-driven Delta adjustments.
Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection cs.CV · 2026-04-18 · unverdicted · none · ref 1 · internal anchor
Vision-language grounding shows high prompt sensitivity, with different wordings for the same object leading to distinct instance selections and text embeddings explaining only 34% of the disagreement.
SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning cs.CV · 2026-04-15 · unverdicted · none · ref 12 · internal anchor
SatBLIP fine-tunes a satellite-adapted BLIP model on GPT-4o-generated captions to predict county-level SVI from satellite tiles and uses SHAP to highlight key features like roof condition and vegetation.
Few-Shot Contrastive Adaptation for Audio Abuse Detection in Low-Resource Indic Languages cs.SD · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
CLAP audio representations with few-shot contrastive adaptation achieve competitive abusive speech detection across ten Indic languages, though gains are language-dependent and not always larger with more examples.
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems cs.MM · 2026-04-07 · unverdicted · none · ref 32 · internal anchor
DAT combines a small-large model cascade with fine-tuning and bandwidth-aware multi-stream transmission to deliver high-accuracy event recognition and low-latency alerts for video streams in edge-cloud systems.
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification cs.CV · 2026-04-06 · unverdicted · none · ref 37 · internal anchor
Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP cs.LG · 2026-04-01 · unverdicted · none · ref 13 · internal anchor
Matched learning-rate experiments show LoRA retains substantially higher zero-shot transfer (45% vs 11% on EuroSAT, 58% vs 9% on Pets) than Full FT in CLIP adaptation.
Are vision-language models ready to zero-shot replace supervised classification models in agriculture? cs.CV · 2025-12-17 · unverdicted · none · ref 2 · internal anchor
Zero-shot VLMs reach at most 62% accuracy on agricultural classification tasks while supervised models like YOLO11 perform markedly higher, indicating they are not ready to replace task-specific systems.
Physics-Based Benchmarking Metrics for Multimodal Synthetic Images cs.CV · 2025-11-19 · unverdicted · none · ref 1 · internal anchor
PCMDE is a three-stage metric that extracts multimodal features, fuses components with confidence weights, and applies LLM-based physics-guided reasoning to assess synthetic image quality beyond standard scores like BLEU or CLIPScore.
CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images cs.CV · 2025-06-13 · unverdicted · none · ref 13 · internal anchor
A lightweight multi-modal CLIP pipeline predicts exact-match geographical tags on a Kaggle subset of the Geograph crowdsourced image archive by fusing image, location, and title embeddings.

Learning Transferable Visual Models From Natural Language Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer