archive

Every paper Pith has read. Search by title, abstract, or pith.

5081 papers in cs.CV · page 1

cs.CV 2026-05-14 reviewed

Memory bank preserves characters across 48-shot gaps in video
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Meng Wei +3
cs.CV 2026-05-14 reviewed

One token unifies agentic and latent visual reasoning
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Pheng-Ann Heng +3
cs.CV 2026-05-14 reviewed

The paper proposes RefDecoder
RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Bohan Fang +4
cs.CV 2026-05-14 reviewed

New index catches 3D geometry errors in video generators
Quantitative Video World Model Evaluation for Geometric-Consistency

Jiaxin Wu +4
cs.CV 2026-05-14 reviewed

Frozen video models follow camera paths via simple warp interface
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Tong He +1
cs.CV 2026-05-14 reviewed

Reward-driven planner and orchestrator improve multi-step image edits
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

Anirudh Sundara Rajan +2
cs.CV 2026-05-14 reviewed

Geometry-first method cuts satellite-to-street 3D error by 23 percent
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Bin Tan +8
cs.CV 2026-05-14 reviewed

The paper introduces MicroscopyMatching
MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions

Haoxuan Qu +5
cs.GR 2026-05-14 reviewed

Meschers process impossible objects without cuts or bends
Meschers: Geometry Processing of Impossible Objects

Ana Dodik +6
cs.CV 2026-05-14 reviewed

Head ranking doubles KV cache compression in image generators
HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

Axel Berg +4
cs.CV 2026-05-14 reviewed

The paper presents the Closed-Loop Visual Reasoning (CLVR) framework that integrates…
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Hanbo Cheng +4
cs.CV 2026-05-14 reviewed

Shared channel basis across frequencies boosts spectral mixers
CHASM: Cross-frequency Harmonized Axis-Separable Mixing for Spectral Token Operators

Hongli Chen +5
cs.CV 2026-05-14 reviewed

Model reads cell types and protein levels from label-free images
Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning

Ardhendu Behera +1
cs.CV 2026-05-14 reviewed

Vision features align LLM text with clinical data for stroke prognosis
Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

Guanjie Wang +7
cs.CV 2026-05-14 reviewed

Adaptive mode switching raises fidelity on complex image prompts
Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

Bingjie Gao +11
cs.CV 2026-05-14 reviewed

Dual-branch model copies text styles across languages in scenes
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

Fangmin Zhao +5
cs.CV 2026-05-14 reviewed

Model generates sign language replies from signing context alone
Towards Continuous Sign Language Conversation from Isolated Signs

Chanyoung Kim +6
cs.CV 2026-05-14 reviewed

VLMs fail to locate hidden functional objects from task instructions
SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Gueter Josmy Faure +4
cs.CV 2026-05-14 reviewed

Generative model turns SDR video into HDR by predicting bracketed exposures
Generating HDR Video from SDR Video

Daisuke Iso +8
cs.CV 2026-05-14 reviewed

Driving model gains planning edge by forecasting 3D futures
EponaV2: Driving World Model with Comprehensive Future Reasoning

Jian Yang +10
cs.CV 2026-05-14 reviewed

Randomly initialized nets match active learning without candidate models
Are Candidate Models Really Needed for Active Learning?

Harshini Mridula Mohan +4
cs.CV 2026-05-14 reviewed

Multiscale VLM features raise video edit quality
MiVE: Multiscale Vision-language features for reference-guided video Editing

Chengjing Wu +6
cs.CV 2026-05-14 reviewed

Anatomy topology across patients boosts medical scan pre-training
Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging

Chen Jiang +10
cs.CV 2026-05-14 reviewed

New dataset tracks urban land and vegetation shifts with 5221 Sentinel-2 pairs
TERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection

Omkar Oak +3
cs.CV 2026-05-14 reviewed

Vision framework with physical priors lifts water level accuracy
Vision-Based Water Level and Flow Estimation

ZhiXin Sun
cs.CV 2026-05-14 reviewed

RefineCAM improves high-resolution CAMs for CNN explanations
How to Evaluate and Refine your CAM

Alessandra Stramiglio +3
cs.CV 2026-05-14 reviewed

Multi-label benchmark shows MLLMs still miss full emotion mixes
MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

Mo Fan +5
cs.LG 2026-05-14 reviewed

Learned potential reweights bridges to improve generative fidelity
Action-Inspired Generative Models

Debnath Pal +1
cs.CV 2026-05-14 reviewed

Unified diffusion generates aligned VIS-IR-Label triplets from few pairs
UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

Chen Ding +6
cs.CV 2026-05-14 reviewed

The paper introduces SIRA, an internal contrastive decoding method that reduces…
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

Junzhe Chen +5
cs.CV 2026-05-14 reviewed

ViMU benchmark tests video AI on hidden meanings
ViMU: Benchmarking Video Metaphorical Understanding

Qi Li +1
cs.CV 2026-05-14 reviewed

Hybrid Mamba-attention model extends rainfall forecasts to three hours
MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting

Boyu Liu +12
cs.CV 2026-05-14 reviewed

Gaussians replace grids to lift panoramic images into 3D detections
Towards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach

Kanglin Ning +5
cs.CV 2026-05-14 reviewed

Two-stage model fuses radar and satellite for sharper rain forecasts
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

Boyu Liu +8
cs.CV 2026-05-14 reviewed

TOPOS locks single-image 3D heads to fixed studio topology
TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

Bojun Xiong +8
cs.CV 2026-05-14 reviewed

Higher-order stain stats raise federated pathology accuracy 3.9%
FedStain: Modeling Higher-Order Stain Statistics for Federated Domain Generalization in Computational Pathology

Fengyi Zhang +2
cs.CV 2026-05-14 reviewed

Aggregated vectors make different financial docs look identical
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

Ho Hung Lim +1
cs.CV 2026-05-14 reviewed

Dispersive loss on batch features sharpens medical boundaries
Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation

Guowei Zou +3
cs.CV 2026-05-14 reviewed

Framework turns fMRI signals into videos via semantic stages
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

Biao Gong +8
cs.CV 2026-05-14 reviewed

Latent alignment of images to masks improves medical segmentation
SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation

Guowei Zou +3
cs.CV 2026-05-14 reviewed

Agent pipeline builds 100k layered wild images for accurate decomposition
LiWi: Layering in the Wild

Dong Chen +9
cs.CV 2026-05-14 reviewed

2D convolutions extract temporal gait patterns via strip pooling
Local Spatiotemporal Convolutional Network for Robust Gait Recognition

Cunrong Li +2
cs.CV 2026-05-14 reviewed

RC metrics align object removal scores with human perception
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

Daiguo Zhou +8
cs.CV 2026-05-14 reviewed

Mask drift triggers repetition in diffusion vision-language models
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

Chanyong Yoon +2
cs.CV 2026-05-14 reviewed

The paper proposes using sparse images from different camera views captured at different…
From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper

Changjie Chen +6
cs.CV 2026-05-14 reviewed

ArcGate activation adapts shape to raise remote sensing accuracy
ArcGate: Adaptive Arctangent Gated Activation

Alejandro C. Frery +4
cs.CV 2026-05-14 reviewed

Head-wise sparsity speeds video diffusion 1.93x
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

Fei Chao +5
cs.CV 2026-05-14 reviewed

Training-free method stretches video generation to full minutes
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

Chi Zhang +3
cs.CV 2026-05-14 reviewed

GAN upsampling plus expert fusion cuts artifact bias in image detectors
Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

Gao Li +5
cs.CV 2026-05-14 reviewed

GeoVista plans globally then inspects branches for satellite images
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

Bo Yang +12