super hub Mixed citations

Learning Transferable Visual Models From Natural Language Supervision

Aditya Ramesh, Alec Radford, Chris Hallacy, Gabriel Goh, Jong Wook Kim, Sandhini Agarwal · 2021 · cs.CV · arXiv 2103.00020

Mixed citation behavior. Most common role is background (69%).

265 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 265 citing papers more from Aditya Ramesh arXiv PDF

abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 method 8 baseline 4 other 1

citation-polarity summary

background 34 use method 8 baseline 4 unclear 2 support 1

claims ledger

abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i

authors

Aditya Ramesh Alec Radford Chris Hallacy Gabriel Goh Jong Wook Kim Sandhini Agarwal

co-cited works

representative citing papers

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

cs.CV · 2026-06-29 · accept · novelty 8.0

MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.

Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.

Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

PuMVR benchmark shows VLMs exhibit script-dependent bias on Punjabi tasks with accuracy gaps up to 16% and script consistency rates as low as 24.8%, even when visual input is provided.

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

Introduces Forged Calamity benchmark and shows that fine-tuned and zero-shot synthetic image detectors lose substantial accuracy on unseen generators and disaster types.

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

cs.CV · 2026-06-15 · conditional · novelty 7.0

A new benchmark for Punjabi reveals VLMs have large script-dependent performance gaps on identical tasks, with consistency as low as 24.8 percent.

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

VISA improves closed-set 3D occupancy mIoU on nuScenes by using VLM instance audits as reliability-weighted semantic supervisors during training of existing world models.

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

Net-Ev² proposes a two-stage generative simulator with structure-guided masked pre-training and topology-aware diffusion using graph U-Net down/upsampling to model network event evolution from text inputs, plus a new 6.5M multimodal benchmark and JL-MMD metric.

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

cs.LG · 2026-06-09 · accept · novelty 7.0

A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

cs.DB · 2026-06-08 · unverdicted · novelty 7.0

ArtiFact is a new multi-modal dataset of 651k museum records used to benchmark cross-modal error detection with seven error categories and semantic query processing challenges.

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Smaller self-supervised ViTs localize objects better via attention than larger ViTs, enabling A² to decouple localization from feature extraction for competitive performance on distribution-shifted benchmarks.

The Regularizing Power of Language-Training Deepfake Detectors

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Dex2HOI is a dual-stream diffusion model with bidirectional cross-attention and motion fusion that generates long bimanual single- and two-object HOI sequences from text at real-time speeds.

citing papers explorer

Showing 50 of 265 citing papers.

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder cs.CV · 2026-06-09 · unverdicted · none · ref 36 · internal anchor
IDEAL improves discrete representation autoencoders by jointly aligning quantized tokens with shallow and deep VFM features, reporting 0.61 rFID on ImageNet and 1.89 gFID for autoregressive image generation.
GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks cs.LG · 2026-06-06 · unverdicted · none · ref 47 · internal anchor
GeoGNN is a two-tower GNN that learns geographic cell embeddings from adjacency graphs and matches them to temporal representations via dot-product similarity plus classification, improving geolocalization accuracy by ~27% on electricity datasets.
FIGMA: Towards FIne-Grained Music retrievAl cs.SD · 2026-06-04 · unverdicted · none · ref 9 · internal anchor
FIGMA proposes a multi-view contrastive architecture plus the FGMCaps dataset to retrieve music from fine-grained textual descriptions of musical attributes, reporting up to 73.3% relative gains over CLAP baselines.
From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models cs.LG · 2026-06-03 · unverdicted · none · ref 23 · internal anchor
SLM adds a dedicated spatial modality and training dataset to LLMs, enabling geometric spatial reasoning and outperforming prompt-based symbolic methods on the new SpatialEval benchmark.
Beyond Compression: Quantifying Spectral Accessibility in Vision Representations cs.CV · 2026-06-02 · unverdicted · none · ref 11 · internal anchor
Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.
Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning cs.CV · 2026-06-02 · unverdicted · none · ref 94 · internal anchor
IN2R rectifies inter-modal noisy correspondence by synthesizing continuous soft prototypes from intra-modal neighbor consensus using a Graph Refiner on dynamic cross-modal memory.
MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding cs.LG · 2026-05-28 · unverdicted · none · ref 38 · internal anchor
MIRAGE uses adaptive multimodal gating on native multimodal backbones plus a transformer encoder to achieve state-of-the-art whole-brain fMRI prediction for naturalistic audiovisual stimuli, outperforming post-hoc unimodal aggregation.
IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents cs.CL · 2026-05-27 · unverdicted · none · ref 49 · internal anchor
IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.
Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images q-bio.NC · 2026-05-27 · unverdicted · none · ref 24 · internal anchor
Backpropagated gradients from vision models predict higher visual cortex signals but diverge from brain hierarchies in spatial and temporal organization.
LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment cs.CV · 2026-05-27 · unverdicted · none · ref 10 · internal anchor
LAST linearizes action manifolds with Lie-algebraic mapping and discretizes them into approximately isotropic charts to align with VL semantic geometry via Gromov-Wasserstein distance.
When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection cs.CV · 2026-05-26 · unverdicted · none · ref 26 · internal anchor
Social gaze consistency between interacting people is proposed as a new semantic cue orthogonal to low-level artifacts for detecting AI-generated images, with reported accuracy gains on vision and vision-language models.
R5DGS: Semantic-Aware 4D Gaussian Splatting with Rigid Body Constraints for Efficient Dynamic Scene Reconstruction cs.CV · 2026-05-25 · unverdicted · none · ref 22 · internal anchor
R5DGS augments physics-driven 4D Gaussian splatting with identity encodings and centroid-only rigid-body dynamics to enable semantic open-vocabulary retrieval and 11 FPS faster extrapolation.
CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation cs.CV · 2026-05-25 · unverdicted · none · ref 32 · internal anchor
A multi-teacher distillation framework that packs 50 effect LoRAs and fast sampling into a single adapter while aiming to avoid concept interference.
Your Embedding Model is SMARTer Than You Think cs.IR · 2026-05-24 · unverdicted · none · ref 17 · internal anchor
SMART unlocks latent multi-vector capabilities in single-vector embedding models by applying late interaction to frozen hidden states shaped by contrastive training, yielding consistent gains on MMEB-V2 and visual document retrieval.
Trajectory-Consistent Calibration for Cache-Accelerated Diffusion Models cs.CV · 2026-05-24 · unverdicted · none · ref 12 · internal anchor
TCC calibrates cached representations in diffusion sampling via an offline iterative procedure that accounts for trajectory shifts, improving FID from 29.83 to 27.35 on PixArt-alpha while preserving reuse policies.
Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification cs.LG · 2026-05-23 · unverdicted · none · ref 15 · internal anchor
StenCE uses cross-modal contrastive learning on paired ECG-angiography data to learn ECG features that classify severe coronary stenosis, reporting the first high performance on this task.
Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations cs.CV · 2026-05-21 · unverdicted · none · ref 26 · 3 links · internal anchor
Authors link memorization to internal instability in diffusion models via latent norms, propose step-wise detection and mitigation achieving AUC >0.999 and 0% memorization rate on Stable Diffusion 1.4.
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation cs.CV · 2026-05-20 · unverdicted · none · ref 21 · internal anchor
UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.
A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability cs.DC · 2026-05-18 · unverdicted · none · ref 43 · internal anchor
RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System cs.RO · 2026-05-18 · unverdicted · none · ref 45 · internal anchor
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation cs.CV · 2026-05-17 · unverdicted · none · ref 25 · 2 links · internal anchor
SegRAG is a training-free retrieval-augmented framework that extracts class-specific point prompts from a filtered DINOv3 feature bank to boost SAM3 semantic segmentation performance on standard and agricultural benchmarks.
How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning cs.RO · 2026-05-16 · unverdicted · none · ref 33 · internal anchor
DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.
Quantitative Video World Model Evaluation for Geometric-Consistency cs.CV · 2026-05-14 · unverdicted · none · ref 23 · internal anchor
PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization cs.CV · 2026-05-12 · unverdicted · none · ref 17 · internal anchor
A dual-branch system using frequency edge cues and CLIP-based synthetic patch detection for accurate, resolution-independent image forgery localization.
Language-Conditioned Visual Grounding with CLIP Multilingual cs.CL · 2026-05-09 · unverdicted · none · ref 1 · internal anchor
Fixing the visual encoder in multilingual CLIP isolates text-branch deficits as the cause of lower visual grounding performance for low-resource languages, with model scaling widening some gaps but not others.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding cs.CV · 2026-05-08 · unverdicted · none · ref 34 · internal anchor
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
MRBTs are LLM-generated, SMT-verified behavior trees that supply modular reward functions and action masks, improving RL training efficiency and success rates on five compositional tasks over baselines.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems cs.AI · 2026-05-03 · unverdicted · none · ref 34 · 2 links · internal anchor
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models cs.CR · 2026-05-02 · conditional · none · ref 24 · internal anchor
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 104 · internal anchor
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models cs.CV · 2026-04-29 · unverdicted · none · ref 21 · internal anchor
SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift cs.CV · 2026-04-27 · unverdicted · none · ref 16 · internal anchor
MG-MTTA improves VLM accuracy under modality-specific shifts by replacing pure entropy minimization with majorization-guided adaptation that incorporates a reliability-aware gate prior.
Only Brains Align with Brains: Cross-Region Alignment Patterns Expose Limits of Normative Models q-bio.NC · 2026-04-23 · unverdicted · none · ref 45 · internal anchor
Alignment pattern analysis reveals that models aligned to individual brain ROIs do not reproduce the stable cross-region alignment profiles observed across human subjects.
Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue cs.CL · 2026-04-22 · unverdicted · none · ref 40 · internal anchor
Incremental visual scaffolding using multimodal models improves persistent common ground representation in situated dialogue by reducing representational blur compared to text-only approaches, with hybrid text-visual yielding best results on the IndiRef benchmark.
REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction cs.CV · 2026-04-20 · unverdicted · none · ref 20 · internal anchor
REVEAL uses vision-language alignment of retinal morphometry and clinical risk narratives plus group contrastive learning to predict AD and dementia about 8 years early.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios cs.CV · 2026-04-15 · unverdicted · none · ref 27 · internal anchor
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional loss plus geometric priors to preserve correct component relationships.
MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis cs.CV · 2026-04-15 · unverdicted · none · ref 27 · internal anchor
MaMe is a differentiable matrix-only token merging method that doubles ViT-B throughput with a 2% accuracy drop on pre-trained models and enables faster, higher-quality image synthesis when paired with MaRe.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 47 · internal anchor
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models cs.AI · 2026-04-13 · unverdicted · none · ref 40 · 2 links · internal anchor
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization cs.CV · 2026-04-13 · unverdicted · none · ref 43 · internal anchor
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations cs.RO · 2026-04-12 · unverdicted · none · ref 140 · internal anchor
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match teleoperation success rates on five tabletop tasks with 5-8x less collection effort.
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification cs.LG · 2026-04-09 · unverdicted · none · ref 22 · internal anchor
ADAPT is a new pre-training paradigm that aligns physical properties of time-series data to allow simultaneous training on 162 diverse classification datasets, achieving new state-of-the-art performance.
Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
LAKE identifies sparse anomaly-sensitive neurons in pre-trained VLMs using minimal normal samples to build compact normality representations and achieve SOTA anomaly detection with neuron-level interpretability.
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection cs.CV · 2026-04-03 · unverdicted · none · ref 15 · 2 links · internal anchor
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis cs.CV · 2026-04-02 · unverdicted · none · ref 33 · internal anchor
A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems cs.LG · 2026-04-01 · conditional · none · ref 9 · internal anchor
A decoupled offline-online framework uses LLMs and latent diffusion models to generate fault scenarios for testing edge-based lane-following models, revealing large robustness drops under conditions like fog.
The Gait Signature of Frailty: Transfer Learning based Deep Gait Models for Scalable Frailty Assessment cs.CV · 2026-03-25 · unverdicted · none · ref 34 · internal anchor
Transfer learning on a new clinical gait dataset shows selective freezing of low-level features in pretrained models yields stable frailty classification, with model attention aligning to lower-limb biomechanics.
Causal Attribution via Activation Patching cs.CV · 2026-03-13 · unverdicted · none · ref 27 · internal anchor
CAAP produces patch attributions in ViTs by direct activation patching on intermediate layers to measure causal contribution to the target class score.
TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization cs.CV · 2026-03-09 · unverdicted · none · ref 34 · internal anchor
TrianguLang achieves state-of-the-art feed-forward text-guided 3D localization and segmentation by using predicted geometry to gate cross-view semantic correspondences without ground-truth poses.
TiledAttention: a CUDA Tile SDPA Kernel for PyTorch cs.LG · 2026-03-02 · unverdicted · none · ref 15 · internal anchor
TiledAttention is a cuTile-based SDPA kernel that balances performance with Python-level customizability for attention research in PyTorch.

Learning Transferable Visual Models From Natural Language Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer