WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.
super hub Canonical reference
In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp
Canonical reference. 71% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
SPoILeR uses multimodal pre-training to enable accurate novel view synthesis of infrared, polarimetric, and multispectral data from RGB-supervised fine-tuning on new scenes.
MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.
PRISM-VO introduces photometric plenoptic bundle adjustment for drift-resilient, metric-scale visual odometry from a single focused plenoptic camera.
MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.
A regularization technique that treats diffusion model outputs as a similarity kernel during material optimization in inverse rendering, enabling joint reconstruction of geometry, materials, and illumination that satisfies the rendering equation and generalizes to new lighting.
HASTE enables training-free dynamic compression of pre-trained CNNs by patch-wise LSH-based merging of redundant channels, reporting 46.2% FLOPs reduction on ResNet34 CIFAR-10 with 1.25% accuracy drop.
AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.
ScaLe-INR is a multi-branch INR architecture that applies directional scaling per the Fourier inverse theorem and a directional edge guidance loss to disentangle scales and improve reconstruction fidelity.
A technique for controllable diversity in text-to-image generation by inducing structured semantic variations at the prompt level via VLM and agentic workflow.
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
DUNE enables exact data-consistency gradients via VJP when deep unrolled networks operate in representation space, yielding better MRI reconstructions than prior heuristic-DC variants.
REST-TS resolves text collapse in multimodal time series forecasting by exclusively supervising the text branch on numerical residuals to compel genuine content extraction from text descriptions.
HUG trains a flow-matching model on a new 1M-frame egocentric human grasp dataset to generate retargetable grasps from single RGB-D images, beating baselines by 23-34% on a new 90-object benchmark.
iSAGE achieves near-dense mIoU performance in remote sensing semantic segmentation using iterative expert clicks on confident model errors with an error-weighted loss, using only 0.011-0.04% of pixels.
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
Attributed Feature Graphs (AFGs) represent CAD features as attributed nodes and relations as directed edges to enable GNN surrogate models that predict design performance with feature-level interpretability on the CarHoods10K dataset.
LL-Bench supplies a human-annotated dataset exposing generative model weaknesses in low-level restoration and introduces LL-Score as an MLLM evaluator that outperforms existing quality metrics and can serve as a training reward.
DPA4 is a new SE(3)-equivariant interatomic potential with EMFA SO(2) convolution that sets new accuracy-cost records on Matbench Discovery and SPICE benchmarks using fewer parameters than prior models.
A new quality-guided approach for semi-supervised medical image segmentation that trains a predictor on synthetic errors to enhance pseudolabel handling.
DELOS applies contrastive learning to phase-folded light curves to detect shallow intermediate-to-long period transits, reporting 15.5% and 11.25% gains in combined precision-recall over BLS and TLS in low-SNR tests plus 3-80x speedups.
Introduces Abstraction Gap metric and CAGE benchmark showing seven of eight VLMs have large gaps between text plausibility and chain-based causal reasoning, with one model succeeding.
Morpheus learns morphable category-level shape priors to produce implicit 3D correspondences in camera space without explicit supervision and releases the HouseCorr3D benchmark with amodal and symmetry annotations.
citing papers explorer
-
Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting
REST-TS resolves text collapse in multimodal time series forecasting by exclusively supervising the text branch on numerical residuals to compel genuine content extraction from text descriptions.
-
MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion
MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.
-
Hyperbolic Concept Bottleneck Models
HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
-
Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks
EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
-
Congestion-Aware Dynamic Axonal Delay for Spiking Neural Networks
CADAD adds activity-dependent dynamic delays to SNNs, improving accuracy on speech datasets while cutting parameter count by about 50% versus prior static delay approaches.
-
Self-Supervised Calibration of Scientific Instruments Using Physical Consistency Constraints
A physics-informed self-supervised framework learns detector calibration parameters and ionic charge-state predictions jointly from raw spectrometer data using iterative pseudo-labelling driven by physical constraints.
-
Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps
SCALLOP replaces Hutchinson's trace estimator with a scalable, vectorized likelihood distillation objective for F2D2 flow maps, cutting training variance and time while improving performance on molecular Boltzmann generators and image data.
-
Estimation--Prediction Tradeoff in Causal Probabilistic Temporal Graphs
Characterizes an estimation-prediction tradeoff in binary logistic models for causal probabilistic temporal graphs and proposes a framework to jointly evaluate temporal link prediction with causal parameter recovery via Cramér-Rao bounds.
-
A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset
Gated Multi-modal Fusion reaches 0.82 macro F1 on HARMES, beating the concatenation baseline of 0.76 by 6 points under leave-one-participant-out evaluation.
-
Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls
Deep UCSL uses a contrastive EM loss on patient-control labels to isolate disease-driven subgroups in medical imaging by suppressing shared healthy variability.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
-
Generalized Category Discovery in Federated Graph Learning
GCD-FGL mitigates neighborhood absorption and global semantic inconsistency in federated generalized category discovery, delivering +4.86 average HRScore gain over baselines on five graph datasets.
-
QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization
QuIDE defines the Intelligence Index I = (C × P) / log₂(T+1) as a unified score for the compression-accuracy-latency trade-off in quantized neural networks, with experiments showing task-dependent optimal bit widths.
-
Model Merging: Foundations and Algorithms
New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
-
Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR
SOLAR prevents latent rehearsal decay in online continual SSL by adaptively managing replay buffers with deviation proxies and an explicit overlap loss, delivering both fast convergence and state-of-the-art final accuracy on vision benchmarks.
-
Fusion or Confusion? Multimodal Complexity Is Not All You Need
Complex multimodal architectures do not reliably outperform unimodal baselines or a simple multimodal baseline under standardized evaluation.
-
Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips
Flipping 1-2 sign bits in DNN parameters, located without data or optimization, drops accuracy to near zero across image classification, detection, segmentation, and language models.
-
From Pixels to Temporal Correlations: Learning Informative Representations for Reinforcement Learning Pre-training
MTCL learns multi-scale temporal correlations in videos via contrastive learning to produce more informative representations that improve sample efficiency and performance in downstream RL tasks.
-
Rethinking Federated Unlearning via the Lens of Memorization
Introduces Grouped Memorization Evaluation and FedMemPrune to remove unique memorized information in federated unlearning while preserving overlapping knowledge.
-
Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes
Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.
-
Explaining Graph Neural Networks for Node Similarity on Graphs
Empirical comparison shows gradient-based explanations for GNN node similarities are actionable, consistent, and retain effects when sparsified, unlike mutual information explanations.
-
Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings
Framework for dataset subset selection via clustering, A/D-optimality, and FAFI with bootstrap intervals to preserve model rankings, showing high Spearman correlation (0.95 with 5 datasets) in TSC but limited gains in recommender systems.
-
Explaining Machine Learning and Memorization with Statistical Mechanics
Thesis uses statistical mechanics to study DAM and RBM models for understanding memorization, low-dimensional learning, and adversarial robustness in neural networks.
- Attribution Graphs and Causal Probing for Mechanistic Discovery and Bias Repair in Multimodal Generative Learning