WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.
super hub Canonical reference
In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp
Canonical reference. 71% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
SPoILeR uses multimodal pre-training to enable accurate novel view synthesis of infrared, polarimetric, and multispectral data from RGB-supervised fine-tuning on new scenes.
MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.
PRISM-VO introduces photometric plenoptic bundle adjustment for drift-resilient, metric-scale visual odometry from a single focused plenoptic camera.
MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.
A regularization technique that treats diffusion model outputs as a similarity kernel during material optimization in inverse rendering, enabling joint reconstruction of geometry, materials, and illumination that satisfies the rendering equation and generalizes to new lighting.
HASTE enables training-free dynamic compression of pre-trained CNNs by patch-wise LSH-based merging of redundant channels, reporting 46.2% FLOPs reduction on ResNet34 CIFAR-10 with 1.25% accuracy drop.
AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.
ScaLe-INR is a multi-branch INR architecture that applies directional scaling per the Fourier inverse theorem and a directional edge guidance loss to disentangle scales and improve reconstruction fidelity.
A technique for controllable diversity in text-to-image generation by inducing structured semantic variations at the prompt level via VLM and agentic workflow.
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
DUNE enables exact data-consistency gradients via VJP when deep unrolled networks operate in representation space, yielding better MRI reconstructions than prior heuristic-DC variants.
REST-TS resolves text collapse in multimodal time series forecasting by exclusively supervising the text branch on numerical residuals to compel genuine content extraction from text descriptions.
HUG trains a flow-matching model on a new 1M-frame egocentric human grasp dataset to generate retargetable grasps from single RGB-D images, beating baselines by 23-34% on a new 90-object benchmark.
iSAGE achieves near-dense mIoU performance in remote sensing semantic segmentation using iterative expert clicks on confident model errors with an error-weighted loss, using only 0.011-0.04% of pixels.
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
Attributed Feature Graphs (AFGs) represent CAD features as attributed nodes and relations as directed edges to enable GNN surrogate models that predict design performance with feature-level interpretability on the CarHoods10K dataset.
LL-Bench supplies a human-annotated dataset exposing generative model weaknesses in low-level restoration and introduces LL-Score as an MLLM evaluator that outperforms existing quality metrics and can serve as a training reward.
DPA4 is a new SE(3)-equivariant interatomic potential with EMFA SO(2) convolution that sets new accuracy-cost records on Matbench Discovery and SPICE benchmarks using fewer parameters than prior models.
A new quality-guided approach for semi-supervised medical image segmentation that trains a predictor on synthetic errors to enhance pseudolabel handling.
DELOS applies contrastive learning to phase-folded light curves to detect shallow intermediate-to-long period transits, reporting 15.5% and 11.25% gains in combined precision-recall over BLS and TLS in low-SNR tests plus 3-80x speedups.
Introduces Abstraction Gap metric and CAGE benchmark showing seven of eight VLMs have large gaps between text plausibility and chain-based causal reasoning, with one model succeeding.
Morpheus learns morphable category-level shape priors to produce implicit 3D correspondences in camera space without explicit supervision and releases the HouseCorr3D benchmark with amodal and symmetry annotations.
citing papers explorer
-
GenSP: Consistent Spherical Parameterization via Learning Shape Generative Models
GenSP learns a continuous neural deformation model from sphere coordinates and latent codes to produce consistent spherical parameterizations for genus-0 shapes.
-
FLORA: A deep learning approach to predict forest attributes from heterogeneous LiDAR data
FLORA is an octree-based deep learning framework with auxiliary data fusion that predicts forest attributes from heterogeneous LiDAR, achieving rRMSE of 12.3% for dominant height and 39% for total volume on 32k French NFI plots.
-
PGUDA: Pressure-Guided Unsupervised Domain Adaptation with Cross-Modal Knowledge Distillation for sEMG-Based Gesture Recognition
PGUDA uses pressure signals to train a teacher network that distills modality-invariant knowledge into an sEMG student via cross-modal distillation, reaching 58.08% cross-subject accuracy with only 5% labeled data for the teacher.
-
Fleet: Few Shots Lead Effective AI-generated Image Detection
Fleet achieves dynamic few-shot adaptation for AIGI detection via avoidance routing in decoupled subspaces, raising accuracy from 20.4% to 73.1% on new generators like Doubao Seedream 4.0 with 10 shots.
-
RBE-Flow: Recurrent Bayesian Estimation on Feature Manifolds for Cross-Modal Registration
RBE-Flow recasts dense cross-modal flow estimation as closed-loop recurrent Bayesian estimation on learned feature manifolds with uncertainty-adaptive updates and achieves SOTA on three registration benchmarks.
-
Self-Supervised Calibration of Scientific Instruments Using Physical Consistency Constraints
A physics-informed self-supervised framework learns detector calibration parameters and ionic charge-state predictions jointly from raw spectrometer data using iterative pseudo-labelling driven by physical constraints.
-
Confidence-feedback-weighted graph matching network: online-offline laser-induced damage site matching under complex interference
A confidence-feedback-weighted graph matching network achieves 96.36% F1-score on damage site matching by using matchability confidence to weight edge features and applying geometric consistency and hard-example mining.
-
Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps
SCALLOP replaces Hutchinson's trace estimator with a scalable, vectorized likelihood distillation objective for F2D2 flow maps, cutting training variance and time while improving performance on molecular Boltzmann generators and image data.
-
Estimation--Prediction Tradeoff in Causal Probabilistic Temporal Graphs
Characterizes an estimation-prediction tradeoff in binary logistic models for causal probabilistic temporal graphs and proposes a framework to jointly evaluate temporal link prediction with causal parameter recovery via Cramér-Rao bounds.
-
A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset
Gated Multi-modal Fusion reaches 0.82 macro F1 on HARMES, beating the concatenation baseline of 0.76 by 6 points under leave-one-participant-out evaluation.
-
Unmasking LAION-5B: Age, Gender, Race, and Emotion Biases in Large-Scale Image Datasets
Empirical audit of LAION-2B-en and LAION-2B-multi finds overrepresentation of young adults, White people, and males plus stereotypical emotion associations across two attribute classifiers.
-
Interpretable Uncertainty Routing Separating Emotion Ambiguity from Distribution Shift in Facial Expression Recognition
Uncertainty decomposition via deep ensembles separates annotator disagreement from distribution shift in FER, enabling a routing mechanism that retains 1.8x more ambiguous faces at matched OOD rejection compared to single-uncertainty baselines.
-
Venice-H1: Failure-Aware Query Re-Ranking with Multi-Scale Grid Signatures for Referring Image Segmentation
Venice-H1 improves failure-case mIoU by 0.89-1.40 points in referring image segmentation via multi-scale grid signatures and a failure-aware re-ranker, with positive CIs on all tested pairs and low harmful-switch rates.
-
Cross-View Yaw Estimation in Location Uncertainty with Line-Aligning Yaw Scoring
Introduces LAYS, a radially invariant line-consensus voting method achieving sub-degree yaw precision in cross-view localization independent of location accuracy.
-
TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living
TimeProVe proposes a propose-then-verify framework using lightweight action-based candidate evidence generation followed by targeted VLM verification for efficient long video temporal reasoning, achieving 7.3% improvement on OTB with 75% fewer VLM calls.
-
Temporal Preference Optimization for Unsupervised Retrieval
TPOUR uses a novel TRPO method to improve unsupervised retrievers for temporal relevance, outperforming baselines including a much larger model on nDCG@5 for explicit and implicit time queries.
-
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.
-
SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining
SceneMiner shows that identity-preserving multi-task fine-tuning removes cross-task interference by zero-initializing new heads and freezing shared-stream parameters, enabling unified BEV scene mining with preserved original heads.
-
Hybrid Robustness Verification for Spatio-Temporal Neural Networks
STBP computes exact closed-form bounds for the first convolutional layer of spatio-temporal networks and propagates scalable approximations through the rest to certify robustness under subset-frame or patch perturbations.
-
SOMA: From Surface Observations to Muscle Anatomy
SOMA recovers spatio-temporal muscle behavior from multi-view RGB surface data and introduces the SKIM soft-tissue deformation dataset as the first such method from RGB observations.
-
VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning
VeriDrive introduces a verifiable counterfactual supervision framework using a Perception-Evaluation-Revision chain and validator-guided correction to generate cost-efficient structured data for vision-language driving models, showing metric gains on nuScenes.
-
DREAM: Dynamic Refinement of Early Assignment Mappings
DREAM proposes intent-aware tokenization, frozen-model evaluation, and dynamic beams to refine early SID assignments and improve cold-start performance in generative recommenders on Amazon benchmarks.
-
COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
COMAP co-evolves textual world models and agent policies for LLMs through on-policy self-distillation, yielding up to 16.75% relative gains on embodied planning, web navigation, and tool-use tasks.
-
Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes
Introduces Photographic Scene Graph and aesthetic-guided comparative planning to generate physically feasible and human-preferred portrait plans in 3D scenes.
-
Deep Psychovisual Image Representations
Proposes a psychovisual-inspired deep learning method that encodes images in learned frequency sub-bands for interpretable semantic structures and reduced depth dependence.
-
MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images
MuNet is an end-to-end graph convolutional network using 2-manifold graphs and a mutualistic training mechanism that jointly optimizes 3D human mesh recovery and clothed reconstruction, reporting state-of-the-art results on six benchmarks.
-
ARCANE-PedSynth: Synthetic Multi-Pedestrian Datasets with Behavioural Crossing Annotations
ARCANE-PedSynth is a CARLA-based framework that generates synthetic multi-pedestrian datasets with behavioral crossing annotations by using hybrid AI-manual control to raise crossing rates and a 12-state FSM for diverse behaviors.
-
MR-LiDAR: A Multi-Resolution Roadside LiDAR Benchmark for Perception Diagnostics and Deployment Guidance
MR-LiDAR benchmark shows an 80-beam LiDAR with optimized distribution can match or exceed 128-beam uniform LiDAR for roadside vehicle and VRU detection.
-
HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models
HyperGuide projects LLM hidden states into hyperbolic space to create a distance-to-origin signal for solution proximity and uses it to guide multi-step generation via a trained head and low-rank adapter.
-
Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls
Deep UCSL uses a contrastive EM loss on patient-control labels to isolate disease-driven subgroups in medical imaging by suppressing shared healthy variability.
-
LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators
LiFT factorizes 3D medical volume synthesis into per-slice 2D generation and inter-slice trajectory learning, using a tri-planar drifting loss for unconditional coherence and a z-context mixer for paired translation tasks.
-
3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds
3DTMDet proposes a hybrid Mamba-Transformer architecture with a 3DHMT block and LiDAR-inspired voxel generation to improve 3D object detection in point clouds, outperforming prior methods on KITTI and ONCE datasets.
-
A General Differentiable Ray-Wave Framework for Hybrid Refractive-Diffractive System Modeling and Optimization
A plug-and-play differentiable model bridging ray and wave optics for hybrid systems that enables end-to-end optimization of planar and conformal diffractive elements.
-
Deep Pre-Alignment for VLMs
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
-
MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation
MULTI uses two-stage textual inversion to disentangle camera lens, sensor, view, and domain factors for novel image generation, supporting dataset extension and ControlNet modifications on the new DF-RICO benchmark.
-
Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization
Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.
-
Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
SkyPart achieves state-of-the-art single-pass cross-view geo-localization on SUES-200, University-1652, and DenseUAV by using prototype-based part discovery, altitude-conditioned modulation, and Kendall-weighted loss, with widening gains under weather corruptions.
-
Learning Point Cloud Geometry as a Statistical Manifold: Theory and Practice
Point cloud geometry is cast as a statistical manifold of per-point Gaussians, with POLI learning the mapping self-supervisedly to improve perception without labeled data.
-
MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition
MAG-VLAQ fuses multi-modal ground and aerial data via ODE-conditioned vector-of-locally-aggregated-queries to nearly double recall@1 on aerial-ground place recognition benchmarks.
-
Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark Removal
Current AI image watermark removal attacks replace the watermark with a different forensic signal, allowing independent detectors to distinguish processed outputs from clean images at over 98% true-positive rate under a 1% false-positive budget.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
-
Generalized Category Discovery in Federated Graph Learning
GCD-FGL mitigates neighborhood absorption and global semantic inconsistency in federated generalized category discovery, delivering +4.86 average HRScore gain over baselines on five graph datasets.
-
QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization
QuIDE defines the Intelligence Index I = (C × P) / log₂(T+1) as a unified score for the compression-accuracy-latency trade-off in quantized neural networks, with experiments showing task-dependent optimal bit widths.
-
SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification
SHIELD is a new diverse clinical note dataset paired with distilled small language models that achieve 0.89 span-level precision and 0.88 recall for on-premise PHI de-identification.
-
Model Merging: Foundations and Algorithms
New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Where are they looking in the operating room?
Gaze-following models on extended 4D-OR and Team-OR datasets reach F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition while improving team communication detection by over 30%.
-
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
-
Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR
SOLAR prevents latent rehearsal decay in online continual SSL by adaptively managing replay buffers with deviation proxies and an explicit overlap loss, delivering both fast convergence and state-of-the-art final accuracy on vision benchmarks.
-
Towards Lifelong Aerial Autonomy: Geometric Memory Management for Continual Visual Place Recognition in Dynamic Environments
A Learn-and-Dispose memory framework using static satellite anchors and diversity-driven dynamic buffers improves retention in continual aerial visual place recognition by 7.8% over random selection on a new 21-sequence benchmark.