super hub Canonical reference

Emogen: Emotional image content generation with text-to-image diffusion models

Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu · 2024 · arXiv 2733.2024

Canonical reference. 91% of citing Pith papers cite this work as background.

250 Pith papers citing it

Background 91% of classified citations

read on arXiv browse 250 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 83 dataset 6 baseline 2 method 2

citation-polarity summary

background 85 use dataset 4 baseline 2 use method 2

co-cited works

representative citing papers

WildBox: A Dataset and Benchmark for Aerial Monocular 3D Detection of African Savanna Wildlife

cs.CV · 2026-06-19 · unverdicted · novelty 8.0

WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.

ScaLe-INR: Scale and Learn Implicit Neural Representations

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

ScaLe-INR is a multi-branch INR architecture that applies directional scaling per the Fourier inverse theorem and a directional edge guidance loss to disentangle scales and improve reconstruction fidelity.

MATCH: Flow Matching for Multi-View Anomaly Detection

cs.CV · 2026-06-23 · unverdicted · novelty 7.0 · 2 refs

MATCH is the first flow matching method for multi-view anomaly detection, reporting SOTA results on Real-IAD and the first comprehensive evaluation on MANTA-Tiny while enabling real-time use by omitting the divergence term.

GeoFidelity-Bench: Evaluating Segment-Level Geographic Fidelity in Text-to-Image Street-View Generation

cs.CV · 2026-06-22 · unverdicted · novelty 7.0 · 4 refs

GeoFidelity-Bench shows text-to-image models gain city-level plausibility from local names but achieve near-zero improvement in exact segment identity, with GPS coordinates adding no benefit.

Arbor: Explicit Geometric Conditioning for Controllable 3D Asset Generation

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Arbor attaches constraint mesh tokens to a frozen text-to-3D denoiser to enable controllable generation obeying hull, avoidance, and touch constraints.

Leveraging target dynamics for imaging in complex media

physics.optics · 2026-06-21 · unverdicted · novelty 7.0

Target dynamics provide an intrinsic source of variation equivalent to controlled illumination changes, enabling scattering-compensated reconstruction of dynamic scenes with one acquisition per frame in holographic and fluorescence imaging.

4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

cs.CV · 2026-06-21 · conditional · novelty 7.0 · 2 refs

The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.

FLM-Occ: Feed-forward Likelihood Maximization for Efficient Indoor Occupancy Prediction

cs.CV · 2026-06-19 · unverdicted · novelty 7.0

FLM-Occ reformulates indoor occupancy prediction as feed-forward likelihood maximization over a mixture model with volume-normalized weights, achieving superior accuracy on Occ-ScanNet using only 32 superquadrics.

HERO: Hypothesis-Driven Evidence Retrieval from Omics for Multi-Task Breast Cancer Analysis

cs.CV · 2026-06-19 · unverdicted · novelty 7.0

HERO maps DNA methylation and miRNA to a 16-dimensional intent vector for TF-IDF caption retrieval and cosine-gated repair in VLM-based multi-task breast cancer prediction, claiming SOTA on TCGA-BRCA.

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

StylisticBias benchmark shows 15 visual attributes explain nearly 80% of bias variation in six MLLMs by isolating single cues like age and fashion in generated images.

Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset

cs.CV · 2026-06-16 · conditional · novelty 7.0

CloudLULC-Net is an end-to-end heterogeneous SAR-optical fusion network for LULC mapping under cloud contamination that achieves 86.60% OA, 83.29% F1, and 73.51% mIoU on a new global benchmark of 40,223 samples.

TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation

cs.CV · 2026-06-10 · unverdicted · novelty 7.0 · 2 refs

A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.

Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

FisherAdapTune uses temporal drift in Fisher geometry, measured by scale-invariant Jensen-Shannon distance, to progressively freeze stabilized parameter groups during fine-tuning, reporting gains on segmentation and zero-shot transfer.

Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0 · 2 refs

An ILP-based oracle applied to seven VIS methods on YouTube-VIS and OVIS shows tracking instability as the dominant bottleneck, producing gaps exceeding 20 AP under occlusion while classification impact is secondary.

Bridging CAD and Data-Driven Design: Attributed Feature Graphs for Engineering Design

cs.CE · 2026-06-04 · unverdicted · novelty 7.0 · 3 refs

Attributed Feature Graphs (AFGs) represent CAD features as attributed nodes and relations as directed edges to enable GNN surrogate models that predict design performance with feature-level interpretability on the CarHoods10K dataset.

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

cs.CV · 2026-06-04 · conditional · novelty 7.0

Empirical study of five LVR variants finds cosine alignment negatively correlates with accuracy (r=-0.94), supervised latents are bypassed under corruption (max 4-point shift), and answers are decodable downstream but not at the latent.

Multimarginal flow matching with optimal transport potentials

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

OTP-FM extends conditional flow matching by incorporating dynamic optimal transport potentials to enable efficient multimarginal transport learning with intermediate observed marginals.

TIDES: Time-Derivative Event Simulation via Deformable Reconstruction

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

TIDES simulates realistic event camera streams in continuous time via dynamic Gaussian splatting with adaptive occlusion handling and sensor artifact modeling, claiming SOTA fidelity and better downstream transfer than prior methods.

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

MERIT enables decentralized instruction tuning via conflict-aware PCA splitting and parameter-space merging, raising average benchmark scores above joint training on multimodal and text mixtures.

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

SuperMemory-VQA provides 4,853 human-verified QA pairs from 52.9 hours of egocentric AI glasses recordings to benchmark AI systems on realistic long-horizon memory tasks including an unanswerable option.

From Noise to Control: Parameterized Diffusion Policies

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

Parameterized Diffusion Policy learns a behavior manifold to condition diffusion policies on low-dimensional continuous parameters, enabling interpolation between strategies and adaptation to novel constraints without policy weight updates.

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

DirectorBench is a profile-aware diagnostic benchmark that localizes bottlenecks in long-form video generation workflows using structured checkpoints and multi-agent evaluation.

The Abstraction Gap in Vision-Language Causal Reasoning

cs.CL · 2026-05-27 · unverdicted · novelty 7.0 · 2 refs

Introduces Abstraction Gap metric and CAGE benchmark showing seven of eight VLMs have large gaps between text plausibility and chain-based causal reasoning, with one model succeeding.

RS2AD-LiDAR: End-to-End Autonomous Driving LiDAR Data Generation from Roadside Sensor Observations

cs.CV · 2026-05-22 · unverdicted · novelty 7.0 · 2 refs

RS2AD-LiDAR reconstructs vehicle LiDAR data from roadside observations via coordinate transformation, virtual LiDAR modeling and resampling, claimed as the first such method, with experiments showing improved object detection when mixed with real data.

citing papers explorer

Showing 30 of 30 citing papers after filters.

Field-Localized Forgery Detection for Digital Identity Documents cs.CV · 2026-05-09 · unverdicted · none · ref 15 · 2 links
FLiD is a field-localized forgery detection method for identity documents that outperforms full-document baselines and general detectors with significantly fewer parameters.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV · 2026-05-05 · unverdicted · none · ref 35
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
CSGuard: Toward Forgery-Resistant Watermarking in Diffusion Models via Compressed Sensing Constraint cs.CV · 2026-05-02 · unverdicted · none · ref 3 · 4 links
CSGuard binds diffusion-model watermarks to a secret matrix via compressed sensing, cutting forgery attack success from 100% to 28.12% while preserving 100% detection on legitimate images.
Towards Temporal Compositional Reasoning in Long-Form Sports Videos cs.CV · 2026-04-24 · unverdicted · none · ref 42
SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.
HumanScore: Benchmarking Human Motions in Generated Videos cs.CV · 2026-04-22 · unverdicted · none · ref 23
HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.
Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data cs.CV · 2026-04-21 · unverdicted · none · ref 27 · 3 links
DHCNet improves ultra-fine-grained visual categorization by progressively building holistic cognition from local discrepancies using self-shuffling and refinement on limited data.
Efficient Video Diffusion Models: Advancements and Challenges cs.CV · 2026-04-17 · unverdicted · none · ref 57 · 2 links
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions cs.CV · 2026-04-07 · unverdicted · none · ref 18 · 2 links
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
A global dataset of continuous urban dashcam driving cs.CV · 2026-04-01 · accept · none · ref 45
CROWD is a new global dataset of 51,753 continuous urban dashcam segments spanning over 20,000 hours from 238 countries, with manual labels and automated object detections for routine driving analysis.
PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives cs.CV · 2026-05-11 · unverdicted · none · ref 10 · 2 links
PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.
MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition cs.CV · 2026-05-10 · unverdicted · none · ref 3 · 2 links
MAG-VLAQ fuses multi-modal ground and aerial data via ODE-conditioned vector-of-locally-aggregated-queries to nearly double recall@1 on aerial-ground place recognition benchmarks.
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 16
HFRU is a two-stage reinforcement unlearning method operating on the vision encoder with GRPO optimization and an abstraction reward that achieves over 98% forgetting and retention on object and face tasks with negligible hallucination.
Is Class Signal Clustered or Routed in Task-Induced Implicit Neural Representation Weight Spaces? cs.CV · 2026-05-08 · unverdicted · none · ref 23 · 2 links
Task-induced INR weights are classifiable because their class signal is routed through the reader rather than forming raw geometric clusters.
EAPFusion: Intrinsic Evolving Auxiliary Prior Guidance for Infrared and Visible Image Fusion cs.CV · 2026-05-03 · unverdicted · none · ref 23 · 2 links
EAPFusion uses self-evolving intrinsic priors to produce dynamic, scene-adaptive convolution kernels and channel-mixing fusion for infrared-visible images, reporting state-of-the-art results and downstream gains.
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring cs.CV · 2026-04-28 · conditional · none · ref 26 · 2 links
SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietary models without internal access or per-task retraining.
Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image cs.CV · 2026-04-15 · unverdicted · none · ref 31 · 4 links
Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running substantially faster.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning cs.CV · 2026-04-09 · unverdicted · none · ref 44
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks cs.CV · 2026-04-05 · conditional · none · ref 24
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning cs.CV · 2026-04-03 · unverdicted · none · ref 9 · 2 links
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition cs.CV · 2026-05-11 · unverdicted · none · ref 24
A fine-tuned large language-vision model achieves 98% accuracy on visual question answering for military vehicle identification in SAR imagery from an extended MSTAR benchmark.
VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement cs.CV · 2026-05-10 · unverdicted · none · ref 58
VFM-SDM enables accurate multi-directional structural displacement measurement from video using pre-trained vision models for camera estimation and point tracking, combined with geometry constraints, without task-specific training or preparation.
CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition cs.CV · 2026-05-01 · unverdicted · none · ref 6
CEZSAR uses contrastive learning to align video and sentence embeddings with automatic negative sampling, claiming state-of-the-art zero-shot action recognition on UCF-101 and Kinetics-400.
Hyperspectral Image Classification via Efficient Global Spectral Supertoken Clustering cs.CV · 2026-04-30 · unverdicted · none · ref 41
DSCC groups spectrally similar and spatially close pixels into supertokens using multi-criteria distance and soft labels, then classifies at the token level to achieve 0.728 CF1 at 197.75 FPS on WHU-OHS.
Weak-to-Strong Knowledge Distillation Accelerates Visual Learning cs.CV · 2026-04-16 · unverdicted · none · ref 11 · 4 links
Weak-to-strong knowledge distillation applied early and then turned off accelerates convergence to target performance in visual learning tasks by factors of 1.7-4.8x.
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction cs.CV · 2026-04-03 · unverdicted · none · ref 40
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges cs.CV · 2026-04-09 · unverdicted · none · ref 9 · 3 links
A survey that organizes methods for cross-domain object detection into a taxonomy, analyzes domain shift across detection stages, and outlines persistent challenges.
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding cs.CV · 2025-08-28 · unverdicted · none · ref 126 · 4 links
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions cs.CV · 2025-07-06 · unverdicted · none · ref 47
A literature review that categorizes deep learning approaches for visual hand gesture recognition, summarizes state-of-the-art methods across tasks, reviews datasets and metrics, and identifies challenges and future directions.
Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization cs.CV · 2026-05-12 · unreviewed · ref 3 · 5 links
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring cs.CV · 2026-05-08 · unreviewed · ref 9

Emogen: Emotional image content generation with text-to-image diffusion models

hub tools

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer