WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.
super hub Mixed citations
Derf: Decomposed radiance fields
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
TabPFN is a Prior-Data Fitted Network that approximates Bayesian inference for small tabular classification by training a Transformer once on synthetic data drawn from a causal prior, then solves new tasks in a single forward pass without further updates.
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.
SpikeTAD proposes the first SNN-based end-to-end TAD model, reporting 67.2% mAP on THUMOS14 and 37.42% on ActivityNet-1.3 with extremely low power consumption.
An ILP-based oracle applied to seven VIS methods on YouTube-VIS and OVIS shows tracking instability as the dominant bottleneck, producing gaps exceeding 20 AP under occlusion while classification impact is secondary.
Reveal-IG performs path attribution by integrating model output changes along trajectories in a space of probe distributions rather than input-space paths, retaining completeness and handling multiscale or uncertain features.
A new quality-guided approach for semi-supervised medical image segmentation that trains a predictor on synthetic errors to enhance pseudolabel handling.
DARE-EEG is a self-supervised EEG foundation model that enforces mask-invariance via contrastive mask alignment and momentum anchor alignment, plus conv-linear-probing for heterogeneous setups, achieving SOTA accuracy and cross-dataset portability.
A rule-based strikingness measure is added to TKGR metrics to weight rare events higher, revealing that models weaken on striking events and ensemble gains come mostly from trivial fits.
Vector Scaffolding uses Interior Gradient Aggregation, Progressive Stratification, and Rapid Inflation Scheduling to achieve 2.5x faster optimization and up to 1.4 dB higher PSNR in differentiable vectorization.
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.
Calibration error tracks curvature via shared margin-dependent exponential tails; a margin-aware objective improves out-of-sample calibration across optimizers.
DHCNet improves ultra-fine-grained visual categorization by progressively building holistic cognition from local discrepancies using self-shuffling and refinement on limited data.
SARR modifies trigonometric rotation encodings with object symmetry orders to produce unique continuous poses, enabling standard CNNs to outperform existing methods on symmetry-aware 6D pose estimation without custom losses or 3D models.
Orthogonal transformations before order reduction in matrix zonotopes produce order-of-magnitude smaller reachable set volumes while keeping generator counts comparable.
FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.
Text Encoded Extrusions (TEE) lets LLMs generate and edit manifold 3D meshes by learning sequences of face extrusions from decomposed quadrilateral meshes.
BadVSFM is the first effective backdoor attack on prompt-driven video segmentation foundation models, using a two-stage encoder-decoder strategy to achieve high attack success rates with limited clean performance loss.
A pose-conditioned large-margin contrastive encoder isolates persistent biometric identity cues from transmitted latents in talking-head videoconferencing to flag impersonation attacks via cosine similarity without inspecting the output video.
IMAGEO-Bench evaluates 10 LLMs on image geolocalization across global street scenes, US POIs, and private images, revealing closed-source model advantages and biases favoring high-resource regions.
citing papers explorer
-
WildBox: A Dataset and Benchmark for Aerial Monocular 3D Detection of African Savanna Wildlife
WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
TabPFN is a Prior-Data Fitted Network that approximates Bayesian inference for small tabular classification by training a Transformer once on synthetic data drawn from a causal prior, then solves new tasks in a single forward pass without further updates.
-
4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
-
TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation
A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.
-
SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection
SpikeTAD proposes the first SNN-based end-to-end TAD model, reporting 67.2% mAP on THUMOS14 and 37.42% on ActivityNet-1.3 with extremely low power consumption.
-
Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation
An ILP-based oracle applied to seven VIS methods on YouTube-VIS and OVIS shows tracking instability as the dominant bottleneck, producing gaps exceeding 20 AP under occlusion while classification impact is secondary.
-
Attribution via Distributional Paths for Information Revelation
Reveal-IG performs path attribution by integrating model output changes along trajectories in a space of probe distributions rather than input-space paths, retaining completeness and handling multiscale or uncertain features.
-
Quality-Guided Semi-Supervised Learning for Medical Image Segmentation
A new quality-guided approach for semi-supervised medical image segmentation that trains a predictor on synthetic errors to enhance pseudolabel handling.
-
DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG
DARE-EEG is a self-supervised EEG foundation model that enforces mask-invariance via contrastive mask alignment and momentum anchor alignment, plus conv-linear-probing for heterogeneous setups, achieving SOTA accuracy and cross-dataset portability.
-
Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning
A rule-based strikingness measure is added to TKGR metrics to weight rare events higher, revealing that models weaken on striking events and ensemble gains come mostly from trivial fits.
-
Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization
Vector Scaffolding uses Interior Gradient Aggregation, Progressive Stratification, and Rapid Inflation Scheduling to achieve 2.5x faster optimization and up to 1.4 dB higher PSNR in differentiable vectorization.
-
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
-
TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On
A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
-
Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention
Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.
-
Too Sharp, Too Sure: When Calibration Follows Curvature
Calibration error tracks curvature via shared margin-dependent exponential tails; a margin-aware objective improves out-of-sample calibration across optimizers.
-
Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data
DHCNet improves ultra-fine-grained visual categorization by progressively building holistic cognition from local discrepancies using self-shuffling and refinement on limited data.
-
Towards Symmetry-sensitive Pose Estimation: A Rotation Representation for Symmetric Object Classes
SARR modifies trigonometric rotation encodings with object symmetry orders to produce unique continuous poses, enabling standard CNNs to outperform existing methods on symmetry-aware 6D pose estimation without custom losses or 3D models.
-
Orthogonal Transformations for Efficient Data-Driven Reachability Analysis
Orthogonal transformations before order reduction in matrix zonotopes produce order-of-magnitude smaller reachable set volumes while keeping generator counts comparable.
-
FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation
FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.
-
Learning to Build Shapes by Extrusion
Text Encoded Extrusions (TEE) lets LLMs generate and edit manifold 3D meshes by learning sequences of face extrusions from decomposed quadrilateral meshes.
-
Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models
BadVSFM is the first effective backdoor attack on prompt-driven video segmentation foundation models, using a two-stage encoder-decoder strategy to achieve high attack success rates with limited clean performance loss.
-
Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing
A pose-conditioned large-margin contrastive encoder isolates persistent biometric identity cues from transmitted latents in talking-head videoconferencing to flag impersonation attacks via cosine similarity without inspecting the output video.
-
From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models
IMAGEO-Bench evaluates 10 LLMs on image geolocalization across global street scenes, US POIs, and private images, revealing closed-source model advantages and biases favoring high-resource regions.
-
Cyclic 2.5D Perceptual Loss for Cross-Modal 3D Medical Image Synthesis: T1w MRI to Tau PET
Proposes a cyclic 2.5D perceptual loss with manufacturer SUVR standardization for T1w MRI to tau PET synthesis, reporting improved regional agreement on ADNI and SCAN cohorts across U-Net, UNETR, SwinUNETR, CycleGAN, and Pix2Pix.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
Estimation--Prediction Tradeoff in Causal Probabilistic Temporal Graphs
Characterizes an estimation-prediction tradeoff in binary logistic models for causal probabilistic temporal graphs and proposes a framework to jointly evaluate temporal link prediction with causal parameter recovery via Cramér-Rao bounds.
-
HandMade: Spatial Prompting for Generative 3D Creation with Part-Labeled VR Sketches
HandMade converts segmented VR strokes into multi-view part guidance and structured prompts so generative 3D models better preserve user-specified spatial scaffolds than text-only or sketch baselines.
-
Minkowski-Type Wasserstein Metrics and Barycenters for Location-Scale Mixtures with Application to Domain Adaptation
A Minkowski-type Wasserstein framework for location-scale mixtures reduces multimarginal OT to discrete component transport with linear complexity and shows competitive domain adaptation performance.
-
Venice-H1: Failure-Aware Query Re-Ranking with Multi-Scale Grid Signatures for Referring Image Segmentation
Venice-H1 improves failure-case mIoU by 0.89-1.40 points in referring image segmentation via multi-scale grid signatures and a failure-aware re-ranker, with positive CIs on all tested pairs and low harmful-switch rates.
-
Lighting-Consistent Object Transfer Across Radiance Fields
Diffusion-based per-view harmonization for lighting-consistent object transfer between 3DGS scenes, using heterogeneous training data and final 3D consolidation.
-
Radial Basis Function Networks as Projection Heads in Self-Supervised Learning
RBFN projection heads serve as competitive replacements for MLP heads in SSL and enable SNS, a label-free metric from RBF parameters that correlates strongly with logistic regression evaluation.
-
QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging
QG-MIL introduces four gated transformer components that yield +6.1 average macro F1 improvement over baselines on six whole-slide and cell-level medical imaging benchmarks while producing more uniform attention.
-
Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm
A distribution-alignment framework for unpaired cross-modal knowledge distillation with theoretical guarantees on feature and label alignment.
-
Next-Token Prediction Learns Generalisable Representations of Sleep Physiology
Next-token prediction on multi-modal tokenized sleep signals yields embeddings that match supervised performance with far less labels and generalize to daytime heart data.
-
Multi-Task Crack Foundation Model for Engineering-Reliable Crack Representation and Topology Preservation in Civil Infrastructure
CrackGeoFM is a multi-task framework that adapts a frozen visual foundation model with FCEM, CFAM, and SMTD modules for crack mask prediction, skeleton reconstruction, and uncertainty estimation, reporting SOTA results across 20 datasets including few-shot settings.
-
Do Transformers Need Three Projections? Systematic Study of QKV Variants
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
-
AdaCodec: A Predictive Visual Code for Video MLLMs
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
-
HERO'S JOURNEY: Testing Complex Rule Induction with Text Games
HERO'S JOURNEY benchmark evaluates LLMs on attribute and procedural rule induction across four structural forms, finding limited uneven performance with execution as the main bottleneck and steering helping only attribute tasks.
-
Infinite-Dimensional Spherical Kernel ridge Regression
An intrinsic spherical kernel ridge regression framework is introduced for non-linear responses on spheres, reducing infinite-dimensional estimation to finite via the representer theorem with convergence rates shown.
-
Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation
Proposes DSP, a disentangled semantics and primitives framework using Semantic Anchoring, Primitive Imbuing, and Conceptual Steering that improves 5-shot atypical L2I generation over prior methods.
-
TextTeacher: What Can Language Teach About Images?
TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.
-
Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls
Deep UCSL uses a contrastive EM loss on patient-control labels to isolate disease-driven subgroups in medical imaging by suppressing shared healthy variability.
-
SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining
SpectralEarth-FM is a multisensor hierarchical transformer pretrained on a 40TB co-located HSI-MSI-SAR dataset using a JEPA-style objective and reports state-of-the-art results on hyperspectral and standard EO benchmarks.
-
Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation
TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.
-
Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation
Decomposed Vision-Language Alignment framework factorizes prompts into concept and attribute tokens with Feature-Gated Cross-Attention for better compositional generalization in fine-grained open-vocabulary segmentation.
-
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.
-
PhysEditBench: A Protocol-Conditioned Benchmark for Dense Physical-Map Prediction with Image Editors
PhysEditBench is a protocol-conditioned benchmark evaluating image editors on dense prediction of depth, normal, albedo, roughness, and metallic maps from RGB images using curated data and fixed scoring rules.
-
On What We Can Learn from Low-Resolution Data
Low-resolution data improves high-resolution model performance when high-resolution samples are limited, via KL-divergence bounds and experiments on vision transformers and CNNs.
-
Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization
Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.