WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.
super hub Mixed citations
Derf: Decomposed radiance fields
Mixed citation behavior. Most common role is background (68%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
TabPFN is a Prior-Data Fitted Network that approximates Bayesian inference for small tabular classification by training a Transformer once on synthetic data drawn from a causal prior, then solves new tasks in a single forward pass without further updates.
An explicit model using learned 3D Gaussians for volume compression encodes geometry explicitly and outperforms implicit neural representations on unstructured volumes with faster training.
SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.
RESOLVE provides a controlled multi-resolution LiDAR and camera benchmark for evaluating 3D detection and tracking under point sparsity variations in roadside cooperative perception.
An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.
Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.
SpikeTAD proposes the first SNN-based end-to-end TAD model, reporting 67.2% mAP on THUMOS14 and 37.42% on ActivityNet-1.3 with extremely low power consumption.
An ILP-based oracle applied to seven VIS methods on YouTube-VIS and OVIS shows tracking instability as the dominant bottleneck, producing gaps exceeding 20 AP under occlusion while classification impact is secondary.
Attributed Feature Graphs (AFGs) represent CAD features as attributed nodes and relations as directed edges to enable GNN surrogate models that predict design performance with feature-level interpretability on the CarHoods10K dataset.
Reveal-IG performs path attribution by integrating model output changes along trajectories in a space of probe distributions rather than input-space paths, retaining completeness and handling multiscale or uncertain features.
A new quality-guided approach for semi-supervised medical image segmentation that trains a predictor on synthetic errors to enhance pseudolabel handling.
Morpheus learns morphable category-level shape priors to produce implicit 3D correspondences in camera space without explicit supervision and releases the HouseCorr3D benchmark with amodal and symmetry annotations.
ClothTransformer is a unified latent-space Transformer for cloth simulation that handles body-driven garments, robotic manipulation, and free-fall collisions in one model with 4-9x lower error than prior methods and mesh-resolution independence.
Introduces a unified benchmark for continual anomaly detection with discrete and continuous protocols plus a training-free DINOSaur method that outperforms prior CAD approaches with zero forgetting and sub-100ms edge inference.
iTryOn is a diffusion-based framework that adds spatial 3D hand guidance and semantic action-aware embeddings to handle complex garment deformations during human-clothing interactions in videos.
DARE-EEG is a self-supervised EEG foundation model that enforces mask-invariance via contrastive mask alignment and momentum anchor alignment, plus conv-linear-probing for heterogeneous setups, achieving SOTA accuracy and cross-dataset portability.
A rule-based strikingness measure is added to TKGR metrics to weight rare events higher, revealing that models weaken on striking events and ensemble gains come mostly from trivial fits.
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.
citing papers explorer
No citing papers match the current filters.