A linearized solver estimates rolling-shutter relative pose and motion from 7 affine correspondences in 1.2 ms and reports best-in-benchmark accuracy plus usable translational velocity.
hub Canonical reference
2016.280
Canonical reference. 82% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while exposing cross-city gaps and long-tailed defect vulnerabilities.
On the public ReMIND dataset, a systematic benchmark of six synthesis models across 48 experiments finds LPIPS correlates with downstream segmentation utility while SSIM does not, with SynDiff-2.5D performing best.
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
DP-GCL improves differentially private contrastive learning by bounding group-level contributions through batch partitioning and intra-group augmentation, delivering 5.6% higher image classification accuracy and 20.1% higher retrieval accuracy than existing approaches.
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
DHCNet improves ultra-fine-grained visual categorization by progressively building holistic cognition from local discrepancies using self-shuffling and refinement on limited data.
A system combining VLM landmark instructions with real-time corrective spatial audio reduces route deviations in a small user study compared to VLM-only and Google Maps audio baselines.
MobileMold provides 4941 smartphone microscopy images and shows deep learning models reach 99.5% accuracy on mold detection and food classification tasks.
Quantum circuits for coherent multilayer neural network inference achieve quadratic to polylogarithmic speedups over classical methods depending on quantum data access models for inputs and weights.
MultiMat shows multimodal large models plus constrained search produce higher-quality procedural material graphs than text-only baselines on a new production dataset.
LAFM adapts the source distribution in flow matching policies via a latent action model to better match fragmented robotic action spaces, claiming 23.4% higher real-world success and 10.4% on LIBERO-90 while beating larger pre-trained models.
RBFN projection heads serve as competitive replacements for MLP heads in SSL and enable SNS, a label-free metric from RBF parameters that correlates strongly with logistic regression evaluation.
FATE combines pillar encoding via orthogonal polynomial basis with frequency-aware training to enable event-based object detection at up to 200 Hz without internal temporal sub-binning.
Jaguar replaces prime-modulus HE with power-of-two arithmetic to enable coefficient-domain convolution and local-shift truncation, reporting 2-3.7x lower latency than Cheetah and Rhombus on ResNet-18/50 and MobileNetV2.
New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
Neighbor2Inverse adapts the Neighbor2Neighbor principle to train a denoising network directly in the image domain for low-dose PBI-CT by using independently noised subsampled projections.
Remote SAMsing pipeline boosts SAM2 coverage on remote sensing scenes from 30-68% to 91-98% via multi-pass masking and boundary-aware merging while preserving mask quality.
A threat-oriented digital twinning methodology and open-source modular twin is introduced for security evaluation of autonomous platforms, translating threat analysis into controllable tests for spoofing, replay, and adversarial ML attacks.
Gaze-following models on extended 4D-OR and Team-OR datasets reach F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition while improving team communication detection by over 30%.
A geometric correction technique for side-scan sonar images that refines yaw-pitch attitude by fusing navigation baselines with image-inferred perturbations separated via port-starboard symmetry.
An automated Python simulator, calibrated to one experimental run, generates consistent time-series data for many batch distillation scenarios including anomalies, forming an openly released hybrid dataset for deep anomaly detection.
citing papers explorer
-
Rolling Shutter Relative Pose Estimation Made Practical
A linearized solver estimates rolling-shutter relative pose and motion from 7 affine correspondences in 1.2 ms and reports best-in-benchmark accuracy plus usable translational velocity.
-
4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
-
WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory
WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while exposing cross-city gaps and long-tailed defect vulnerabilities.
-
A Systematic Benchmark of Intraoperative Ultrasound-to-MR Synthesis for Brain Tumour Surgery
On the public ReMIND dataset, a systematic benchmark of six synthesis models across 48 experiments finds LPIPS correlates with downstream segmentation utility while SSIM does not, with SynDiff-2.5D performing best.
-
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
-
Projection-Free Transformers via Gaussian Kernel Attention
Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
-
Differentially Private Contrastive Learning via Bounding Group-level Contribution
DP-GCL improves differentially private contrastive learning by bounding group-level contributions through batch partitioning and intra-group augmentation, delivering 5.6% higher image classification accuracy and 20.1% higher retrieval accuracy than existing approaches.
-
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
-
Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data
DHCNet improves ultra-fine-grained visual categorization by progressively building holistic cognition from local discrepancies using self-shuffling and refinement on limited data.
-
Navig-AI-tion: Navigation by Contextual AI and Spatial Audio
A system combining VLM landmark instructions with real-time corrective spatial audio reduces route deviations in a small user study compared to VLM-only and Google Maps audio baselines.
-
MobileMold: A Smartphone-Based Microscopy Dataset for Food Mold Detection
MobileMold provides 4941 smartphone microscopy images and shows deep learning models reach 99.5% accuracy on mold detection and food classification tasks.
-
Accelerating Inference for Multilayer Neural Networks with Quantum Computers
Quantum circuits for coherent multilayer neural network inference achieve quadratic to polylogarithmic speedups over classical methods depending on quantum data access models for inputs and weights.
-
MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models
MultiMat shows multimodal large models plus constrained search produce higher-quality procedural material graphs than text-only baselines on a new production dataset.
-
Flowing With Purpose: Latent Action Guided Flow Matching Policies For Robotic Manipulation
LAFM adapts the source distribution in flow matching policies via a latent action model to better match fragmented robotic action spaces, claiming 23.4% higher real-world success and 10.4% on LIBERO-90 while beating larger pre-trained models.
-
Radial Basis Function Networks as Projection Heads in Self-Supervised Learning
RBFN projection heads serve as competitive replacements for MLP heads in SSL and enable SNS, a label-free metric from RBF parameters that correlates strongly with logistic regression evaluation.
-
FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection
FATE combines pillar encoding via orthogonal polynomial basis with frequency-aware training to enable event-based object detection at up to 200 Hz without internal temporal sub-binning.
-
Jaguar: Fast Private CNN Inference with Power-of-Two Homomorphic Arithmetic
Jaguar replaces prime-modulus HE with power-of-two arithmetic to enable coefficient-domain convolution and local-shift truncation, reporting 2-3.7x lower latency than Cheetah and Rhombus on ResNet-18/50 and MobileNetV2.
-
Model Merging: Foundations and Algorithms
New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
-
Neighbor2Inverse: Self-Supervised Denoising for Low-Dose Region-of-Interest Phase Contrast CT
Neighbor2Inverse adapts the Neighbor2Neighbor principle to train a denoising network directly in the image domain for low-dose PBI-CT by using independently noised subsampled projections.
-
Remote SAMsing: From Segment Anything to Segment Everything
Remote SAMsing pipeline boosts SAM2 coverage on remote sensing scenes from 30-68% to 91-98% via multi-pass masking and boundary-aware merging while preserving mask quality.
-
Threat-Oriented Digital Twinning for Security Evaluation of Autonomous Platforms
A threat-oriented digital twinning methodology and open-source modular twin is introduced for security evaluation of autonomous platforms, translating threat analysis into controllable tests for spoofing, replay, and adversarial ML attacks.
-
Where are they looking in the operating room?
Gaze-following models on extended 4D-OR and Team-OR datasets reach F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition while improving team communication detection by over 30%.
-
Geometric Correction of Side-Scan Sonar Images with Image-Consistent Attitude Refinement
A geometric correction technique for side-scan sonar images that refines yaw-pitch attitude by fusing navigation baselines with image-inferred perturbations separated via port-starboard symmetry.
-
Automated Batch Distillation Process Simulation for a Large Hybrid Dataset for Deep Anomaly Detection
An automated Python simulator, calibrated to one experimental run, generates consistent time-series data for many batch distillation scenarios including anomalies, forming an openly released hybrid dataset for deep anomaly detection.
-
Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification
Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
-
Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
-
Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information
Holi-DETR improves fashion item detection by integrating co-occurrence probabilities, inter-item spatial arrangements, and body keypoint relationships into the DETR architecture.
-
Graph Signal Denoising Using Regularization by Denoising and Its Parameter Estimation
RED is adapted to graph signals with deep unrolling for parameter estimation, yielding lower MSE than prior graph denoising methods on synthetic and real data.
-
LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
LaV-CoT introduces a multi-stage visual CoT pipeline and GRPO training with language-consistency rewards, delivering up to 9.5% accuracy gains on multilingual VQA benchmarks over similar-sized open models.
-
Near OOD Detection for Vision-Language Prompt Learning with Contrastive Logit Score
Contrastive Logit Score (CLS) improves near OOD detection AUROC by up to 11.67% for pre-trained vision-language prompt learning methods as a plug-and-play post-hoc function.
-
SignNet-1M: Large-Scale Multilingual Sign Language Video Dataset with Downstream Benchmarks
The paper releases SignNet-1M, a 1M-scale augmented dataset for ASL, CSL and DGS with 3DGS and diffusion-based variations, plus benchmarks showing improved cross-shift generalization.
-
Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging
Neuro-JEPA is a sparse multimodal foundation model pretrained on 1,551,862 brain MRI scans that shows stronger and more consistent performance than existing models and CNN baselines across 47 tasks from clinical and public datasets.
-
Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration
MAP-Elites with CPPNs, DSP graphs, and a deep classifier produces diverse synthetic sounds across durations and musical/non-musical contexts.
-
Trustworthy Visual Predicates for Robust Manipulation Understanding under Degradation
Introduces a structured framework showing that visual predicate failures under degradation are non-uniform, with static predicates more robust than dynamic ones like grasp and release, and quantifies downstream accuracy drops.
-
Efficient 3D Content Reconstruction and Generation
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
-
RACANet: Reliability-Aware Crowd Anchor Network for RGB-T Crowd Counting
RACANet proposes a reliability-aware two-stage fusion network with cross-modal pretraining and local anchor modules that outperforms prior RGB-T crowd counting methods on standard benchmarks.
-
Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks
Introduces Explicit Logic Channel (ELC) with LLM, VFM and probabilistic inference for validating, selecting and enhancing MLLMs on zero-shot tasks using Consistency Rate and cross-channel integration.
-
ProBA: Probabilistic Bundle Adjustment with the Bhattacharyya Coefficient
ProBA replaces rigid point tracks with a probabilistic pose graph and 3D Gaussian landmarks, optimizing via negative log-likelihood with the Bhattacharyya coefficient to expand the basin of attraction in prior-free SfM.
-
Neuron ranking -- an informed way to condense convolutional neural networks architecture
Shapley value and variational importance switch methods produce consistent rankings of filter importance in CNNs, enabling compression and interpretability.
-
SEADA: An efficient methodology for optimizing mixed-precision DNNs on multi-precision spatial architectures
SEADA introduces an analytical framework combining cost models, mapping tools, and entropy-based precision selection to optimize mixed-precision DNNs on multi-precision spatial architectures.
-
A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting
A learned linear activation bridge achieves high alignment (cosine ~0.97) between Pythia-160M and Pythia-410M states but produces no improvement in downstream multi-hop answering when injected into the receiver.
-
Improving acoustic drone detection generalization through pretraining and data augmentation
Pretraining on broad sound events plus on-the-fly augmentations improves out-of-domain true-positive rates for acoustic drone detection at fixed low false-positive rates.
-
INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference
INAR-VL routes 36% of visual question answering requests to the edge using lightweight complexity signals, cutting latency 24% and energy 26% while retaining 97% of cloud accuracy.
-
Particle Diffusion Matching: Random Walk Correspondence Search for the Alignment of Standard and Ultra-Widefield Fundus Images
Particle Diffusion Matching uses diffusion-guided random walk searches to align challenging standard and ultra-widefield retinal images, claiming state-of-the-art benchmark performance.
-
Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation
MeCSAFNet reports mIoU gains of 4.8-19.6% over U-Net and SegFormer baselines on FBP and Potsdam datasets by processing spectral channels separately and fusing features with CBAM attention.
-
TwinLiteNet+: An Enhanced Multi-Task Segmentation Model for Autonomous Driving
TwinLiteNet+ is a hybrid-encoder multi-task segmentation model with new UCB, USB, and PCAA modules that reports 92.9% mIoU on drivable area and 34.2% IoU on lane segmentation on BDD100K while using 11x fewer FLOPs than prior models.
-
Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation
HSAN integrates hierarchical semantic graphs, optimal transport-based goal selection, and graph-aware RL to claim SOTA results on VLN-CE tasks.
-
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
- PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging
- FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting