archive
Every paper Pith has read. Search by title, abstract, or pith.
5081 papers in cs.CV · page 1
-
Memory bank preserves characters across 48-shot gaps in video
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
-
One token unifies agentic and latent visual reasoning
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
-
The paper proposes RefDecoder
RefDecoder: Enhancing Visual Generation with Conditional Video Decoding
-
New index catches 3D geometry errors in video generators
Quantitative Video World Model Evaluation for Geometric-Consistency
-
Frozen video models follow camera paths via simple warp interface
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
-
Reward-driven planner and orchestrator improve multi-step image edits
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
-
Geometry-first method cuts satellite-to-street 3D error by 23 percent
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image
-
The paper introduces MicroscopyMatching
MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions
-
Meschers process impossible objects without cuts or bends
Meschers: Geometry Processing of Impossible Objects
-
Head ranking doubles KV cache compression in image generators
HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling
-
The paper presents the Closed-Loop Visual Reasoning (CLVR) framework that integrates…
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
-
Shared channel basis across frequencies boosts spectral mixers
CHASM: Cross-frequency Harmonized Axis-Separable Mixing for Spectral Token Operators
-
Model reads cell types and protein levels from label-free images
Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning
-
Vision features align LLM text with clinical data for stroke prognosis
Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke
-
Adaptive mode switching raises fidelity on complex image prompts
Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners
-
Dual-branch model copies text styles across languages in scenes
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
-
Model generates sign language replies from signing context alone
Towards Continuous Sign Language Conversation from Isolated Signs
-
VLMs fail to locate hidden functional objects from task instructions
SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
-
Generative model turns SDR video into HDR by predicting bracketed exposures
Generating HDR Video from SDR Video
-
Driving model gains planning edge by forecasting 3D futures
EponaV2: Driving World Model with Comprehensive Future Reasoning
-
Randomly initialized nets match active learning without candidate models
Are Candidate Models Really Needed for Active Learning?
-
Multiscale VLM features raise video edit quality
MiVE: Multiscale Vision-language features for reference-guided video Editing
-
Anatomy topology across patients boosts medical scan pre-training
Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging
-
New dataset tracks urban land and vegetation shifts with 5221 Sentinel-2 pairs
TERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection
-
Vision framework with physical priors lifts water level accuracy
Vision-Based Water Level and Flow Estimation
-
RefineCAM improves high-resolution CAMs for CNN explanations
How to Evaluate and Refine your CAM
-
Multi-label benchmark shows MLLMs still miss full emotion mixes
MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
-
Learned potential reweights bridges to improve generative fidelity
Action-Inspired Generative Models
-
Unified diffusion generates aligned VIS-IR-Label triplets from few pairs
UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation
-
The paper introduces SIRA, an internal contrastive decoding method that reduces…
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
-
ViMU benchmark tests video AI on hidden meanings
ViMU: Benchmarking Video Metaphorical Understanding
-
Hybrid Mamba-attention model extends rainfall forecasts to three hours
MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting
-
Gaussians replace grids to lift panoramic images into 3D detections
Towards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach
-
Two-stage model fuses radar and satellite for sharper rain forecasts
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
-
TOPOS locks single-image 3D heads to fixed studio topology
TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation
-
Higher-order stain stats raise federated pathology accuracy 3.9%
FedStain: Modeling Higher-Order Stain Statistics for Federated Domain Generalization in Computational Pathology
-
Aggregated vectors make different financial docs look identical
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
-
Dispersive loss on batch features sharpens medical boundaries
Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation
-
Framework turns fMRI signals into videos via semantic stages
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
-
Latent alignment of images to masks improves medical segmentation
SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation
-
Agent pipeline builds 100k layered wild images for accurate decomposition
LiWi: Layering in the Wild
-
2D convolutions extract temporal gait patterns via strip pooling
Local Spatiotemporal Convolutional Network for Robust Gait Recognition
-
RC metrics align object removal scores with human perception
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
-
Mask drift triggers repetition in diffusion vision-language models
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
-
The paper proposes using sparse images from different camera views captured at different…
From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper
-
ArcGate activation adapts shape to raise remote sensing accuracy
ArcGate: Adaptive Arctangent Gated Activation
-
Head-wise sparsity speeds video diffusion 1.93x
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
-
Training-free method stretches video generation to full minutes
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
-
GAN upsampling plus expert fusion cuts artifact bias in image detectors
Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection
-
GeoVista plans globally then inspects branches for satellite images
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding