MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.
hub Mixed citations
End- to-end object detection with transformers
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GLACIER is a single-stage transformer model treating MS/MS fragmentation as subgraph detection on molecular graphs, reporting 70.0% Top-1 accuracy on MassSpecGym and 8x speedup over prior two-stage methods.
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
Single-layer two-head Transformers learn sparse XOR with O(polylog(d)) parameters in one gradient step, breaking the Omega(d) parameter bottleneck of FFNNs.
Low-cost imprecise robots achieve 80-90% success on six fine bimanual manipulation tasks using imitation learning with a new Action Chunking with Transformers algorithm trained on only 10 minutes of demonstrations.
FlowWM applies flow matching directly in pretrained feature space with a one-step projection mechanism, improving perception accuracy, mode coverage, and horizon robustness on synthetic and real-world benchmarks.
A geometry-aware dynamic-query transformer decoder with Local Strided Cross-Attention raises track reconstruction efficiency from 94.1% to 98.1%, halves latency, and cuts memory use by over 10x versus fixed-query baselines in a simplified HL-LHC simulation.
AMAR uses a transformer with learnable query embeddings for set-based prediction of concurrent activities from composite Wi-Fi CSI, combined with edge feature extraction and vector quantization for bandwidth-efficient deployment.
Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and pseudo-fake samples.
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
CondHead conditionally parameterizes detection heads on semantic embeddings via aggregated expert and dynamically generated streams to improve generalization for novel categories.
MMPM uses PIM for gaze/head/hand interactions and MTP (CVAE with query decoder) to model separate crossing/non-crossing trajectory distributions, outperforming baselines on PIE and JAAD with a new validation protocol.
ReforMe is an interactive document digitization system using layout-aware propagation to generalize user corrections from natural language or direct edits, shown to improve efficiency in a 12-user study on real documents.
Phast applies a transformer encoder plus count-conditioned query decoder to reconstruct photoelectron count and time profiles from simulated PMT waveforms on toy Monte Carlo datasets.
A visual transformer model trained on IRIS inversions predicts chromospheric temperature and density from SDO data with correlations around 0.8 on 80% of test cases.
RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.
MapATM improves lane divider AP by 4.6 and mAP by 2.6 on NuScenes by treating actor trajectories as structural priors for road geometry.
A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authentic samples.
A framework for real-time ergonomic pose prediction from 3D volumetric video that trains personalized classifiers on user-labeled poses captured by RGB-D cameras.
YOLO11n achieves the highest mAP@0.5:0.95 of 0.6065 for apple localization, with other detectors showing trade-offs in recall and precision at low confidence thresholds.
citing papers explorer
-
Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature
MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.
-
GLACIER: Rethinking Mass Spectrum Prediction as an Object Detection Problem
GLACIER is a single-stage transformer model treating MS/MS fragmentation as subgraph detection on molecular graphs, reporting 70.0% Top-1 accuracy on MassSpecGym and 8x speedup over prior two-stage methods.
-
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
-
Flow Matching in Feature Space for Stochastic World Modeling
FlowWM applies flow matching directly in pretrained feature space with a one-step projection mechanism, improving perception accuracy, mode coverage, and horizon robustness on synthetic and real-world benchmarks.
-
Better Queries, Cheaper Attention: Adapting Transformers for Efficient Sparse Reconstruction
A geometry-aware dynamic-query transformer decoder with Local Strided Cross-Attention raises track reconstruction efficiency from 94.1% to 98.1%, halves latency, and cuts memory use by over 10x versus fixed-query baselines in a simplified HL-LHC simulation.
-
AMAR: Lightweight Attention-Based Multi-User Activity Recognition from Wi-Fi CSI
AMAR uses a transformer with learnable query embeddings for set-based prediction of concurrent activities from composite Wi-Fi CSI, combined with edge feature extraction and vector quantization for bandwidth-efficient deployment.
-
Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation
Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.
-
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
-
LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection
LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and pseudo-fake samples.
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
-
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
-
Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos
MMPM uses PIM for gaze/head/hand interactions and MTP (CVAE with query decoder) to model separate crossing/non-crossing trajectory distributions, outperforming baselines on PIE and JAAD with a new validation protocol.
-
ReforMe: Re-Shaping Documents with Contextual Prompting and Layout-Aware Propagation
ReforMe is an interactive document digitization system using layout-aware propagation to generalize user corrections from natural language or direct edits, shown to improve efficiency in a 12-user study on real documents.
-
Phast: Simultaneous reconstruction of photoelectron count and time profiles from PMT waveforms via machine learning
Phast applies a transformer encoder plus count-conditioned query decoder to reconstruct photoelectron count and time profiles from simulated PMT waveforms on toy Monte Carlo datasets.
-
Predicting the thermodynamics in the chromosphere from the translation of SDO data into the IRIS$^{2}$ inversion results using a visual transformer model
A visual transformer model trained on IRIS inversions predicts chromospheric temperature and density from SDO data with correlations around 0.8 on 80% of test cases.
-
RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery
RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
-
Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention
Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.
-
MapATM: Enhancing HD Map Construction through Actor Trajectory Modeling
MapATM improves lane divider AP by 4.6 and mAP by 2.6 on NuScenes by treating actor trajectories as structural priors for road geometry.
-
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authentic samples.
-
A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis
A framework for real-time ergonomic pose prediction from 3D volumetric video that trains personalized classifiers on user-labeled poses captured by RGB-D cameras.
-
A Comparative Study of Modern Object Detectors for Robust Apple Detection in Orchard Imagery
YOLO11n achieves the highest mAP@0.5:0.95 of 0.6065 for apple localization, with other detectors showing trade-offs in recall and precision at low confidence thresholds.
-
Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications
The paper reviews limits in AI vision for robotics and describes work-in-progress on bridging sim-to-real domain gaps by linking real and synthetic training data.