SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, University-1652, and DenseUAV while widening gains under weather corruptions.
hub
In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
51 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
co-cited works
roles
background 1polarities
background 1representative citing papers
GRCA uses emitter-centric geometric culling of rays per triangle to accelerate LiDAR simulation in arbitrarily dynamic scenes, reporting up to 14.55x speedup over Embree and 7.97x over OptiX.
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
CADAD adds activity-dependent dynamic delays to SNNs, improving accuracy on speech datasets while cutting parameter count by about 50% versus prior static delay approaches.
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
Trust-SSL introduces additive-residual trust weights in SSL to selectively handle corruptions in aerial imagery, yielding higher linear-probe accuracy and larger gains under severe degradations than SimCLR or VICReg.
DHCNet improves ultra-fine-grained visual categorization by progressively building holistic cognition from local discrepancies using self-shuffling and refinement on limited data.
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
CROWD is a new global dataset of 51,753 continuous urban dashcam segments spanning over 20,000 hours from 238 countries, with manual labels and automated object detections for routine driving analysis.
Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
MULTI uses two-stage textual inversion to disentangle camera lens, sensor, view, and domain factors for novel image generation, supporting dataset extension and ControlNet modifications on the new DF-RICO benchmark.
Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.
Point cloud geometry is cast as a statistical manifold of per-point Gaussians, with POLI learning the mapping self-supervisedly to improve perception without labeled data.
MAG-VLAQ fuses multi-modal ground and aerial data via ODE-conditioned vector-of-locally-aggregated-queries to nearly double recall@1 on aerial-ground place recognition benchmarks.
Current AI image watermark removal attacks replace the watermark with a different forensic signal, allowing independent detectors to distinguish processed outputs from clean images at over 98% true-positive rate under a 1% false-positive budget.
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
GCD-FGL mitigates neighborhood absorption and global semantic inconsistency in federated generalized category discovery, delivering +4.86 average HRScore gain over baselines on five graph datasets.
QuIDE defines the Intelligence Index I = (C × P) / log₂(T+1) as a unified score for the compression-accuracy-latency trade-off in quantized neural networks, with experiments showing task-dependent optimal bit widths.
New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
citing papers explorer
-
Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, University-1652, and DenseUAV while widening gains under weather corruptions.
-
Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation
GRCA uses emitter-centric geometric culling of rays per triangle to accelerate LiDAR simulation in arbitrarily dynamic scenes, reporting up to 14.55x speedup over Embree and 7.97x over OptiX.
-
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
-
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
-
Hyperbolic Concept Bottleneck Models
HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
-
Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
-
Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks
EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
-
Congestion-Aware Dynamic Axonal Delay for Spiking Neural Networks
CADAD adds activity-dependent dynamic delays to SNNs, improving accuracy on speech datasets while cutting parameter count by about 50% versus prior static delay approaches.
-
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
-
Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning
Trust-SSL introduces additive-residual trust weights in SSL to selectively handle corruptions in aerial imagery, yielding higher linear-probe accuracy and larger gains under severe degradations than SimCLR or VICReg.
-
Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data
DHCNet improves ultra-fine-grained visual categorization by progressively building holistic cognition from local discrepancies using self-shuffling and refinement on limited data.
-
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
-
A global dataset of continuous urban dashcam driving
CROWD is a new global dataset of 51,753 continuous urban dashcam segments spanning over 20,000 hours from 238 countries, with manual labels and automated object detections for routine driving analysis.
-
A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline
Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation
MULTI uses two-stage textual inversion to disentangle camera lens, sensor, view, and domain factors for novel image generation, supporting dataset extension and ControlNet modifications on the new DF-RICO benchmark.
-
Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization
Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.
-
Learning Point Cloud Geometry as a Statistical Manifold: Theory and Practice
Point cloud geometry is cast as a statistical manifold of per-point Gaussians, with POLI learning the mapping self-supervisedly to improve perception without labeled data.
-
MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition
MAG-VLAQ fuses multi-modal ground and aerial data via ODE-conditioned vector-of-locally-aggregated-queries to nearly double recall@1 on aerial-ground place recognition benchmarks.
-
Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark Removal
Current AI image watermark removal attacks replace the watermark with a different forensic signal, allowing independent detectors to distinguish processed outputs from clean images at over 98% true-positive rate under a 1% false-positive budget.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
-
Generalized Category Discovery in Federated Graph Learning
GCD-FGL mitigates neighborhood absorption and global semantic inconsistency in federated generalized category discovery, delivering +4.86 average HRScore gain over baselines on five graph datasets.
-
QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization
QuIDE defines the Intelligence Index I = (C × P) / log₂(T+1) as a unified score for the compression-accuracy-latency trade-off in quantized neural networks, with experiments showing task-dependent optimal bit widths.
-
Model Merging: Foundations and Algorithms
New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Where are they looking in the operating room?
Gaze-following models on extended 4D-OR and Team-OR datasets reach F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition while improving team communication detection by over 30%.
-
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
-
Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR
SOLAR prevents latent rehearsal decay in online continual SSL by adaptively managing replay buffers with deviation proxies and an explicit overlap loss, delivering both fast convergence and state-of-the-art final accuracy on vision benchmarks.
-
Towards Lifelong Aerial Autonomy: Geometric Memory Management for Continual Visual Place Recognition in Dynamic Environments
A Learn-and-Dispose memory framework using static satellite anchors and diversity-driven dynamic buffers improves retention in continual aerial visual place recognition by 7.8% over random selection on a new 21-sequence benchmark.
-
Harnessing Weak Pair Uncertainty for Text-based Person Search
Uncertainty estimation and regularization on weak positive pairs improves mAP by 3.06%, 3.55%, and 6.94% on CUHK-PEDES, RSTPReid, and ICFG-PEDES respectively.
-
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
-
Rethinking IRSTD: Single-Point Supervision Guided Encoder-only Framework is Enough for Infrared Small Target Detection
SPIRE turns IRSTD into centroid regression via single-point supervision and a high-resolution probabilistic encoder, matching prior performance with lower compute and false alarms.
-
Toward Unified Fine-Grained Vehicle Classification and Automatic License Plate Recognition
UFPR-VeSV is a new real-world dataset for fine-grained vehicle classification and automatic license plate recognition collected from Brazilian police cameras, with benchmarks demonstrating its difficulty and the value of joint task use.
-
Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes
Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.
-
VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement
VFM-SDM enables accurate multi-directional structural displacement measurement from video using pre-trained vision models for camera estimation and point tracking, combined with geometry constraints, without task-specific training or preparation.
-
CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition
CEZSAR uses contrastive learning to align video and sentence embeddings with automatic negative sampling, claiming state-of-the-art zero-shot action recognition on UCF-101 and Kinetics-400.
-
Seeing Candidates at Scale: Multimodal LLMs for Visual Political Communication on Instagram
GPT-4o achieves macro F1 scores of 0.89 for politician face recognition and 0.86 for person counting in election Instagram stories, outperforming FaceNet512, RetinaFace, and Google Cloud Vision.
-
From Skeletons to Pixels: Few-Shot Precise Event Spotting via Representation and Prediction Distillation
Multimodal distillation from skeletons to pixels improves few-shot precise event spotting on tennis and figure skating datasets.
-
Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation
MixTGFormer reports state-of-the-art 3D pose estimation errors of 37.6 mm on Human3.6M and 15.7 mm on MPI-INF-3DHP by using parallel GCN-Transformer streams with SE layers for local-global feature fusion.
-
Weak-to-Strong Knowledge Distillation Accelerates Visual Learning
Weak-to-strong knowledge distillation applied early and then turned off accelerates convergence to target performance in visual learning tasks by factors of 1.7-4.8x.
-
Evaluating the Impact of Medical Image Reconstruction on Downstream AI Fairness and Performance
A tandem evaluation framework shows medical image reconstruction maintains diagnostic accuracy despite declining PSNR but can amplify sex-based biases modestly compared to existing model biases.
-
Protecting and Preserving Protest Dynamics for Responsible Analysis
A responsible computing framework substitutes real protest imagery with labeled synthetic reproductions from conditional image synthesis to enable privacy-aware analysis of collective action patterns.
-
Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning
A method combining pretrained YOLO11, YOLOE-26, and Gaze-LLE models detects student gaze targets in collaborative learning videos with F1-score 0.829 without requiring labeled training data.
-
Text Embeddings by Weakly-Supervised Contrastive Pre-training
E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.
-
Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation
Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.
-
XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling
XiYOLO uses iterative energy-aware neural architecture search and scaling to produce object detectors with stronger accuracy-energy tradeoffs than YOLO baselines on GPUs and NPUs.
-
RoomRecon: High-Quality Textured Room Layout Reconstruction on Mobile Devices
RoomRecon delivers a real-time mobile system for high-quality textured 3D room reconstructions that combines AR-guided imaging with generative AI texturing focused on permanent structures and claims to outperform prior methods in quality and speed.
-
From Multimodal Signals to Adaptive XR Experiences for De-escalation Training
An early multimodal XR prototype fuses five signal streams with an interpretation layer to detect escalation cues and enable adaptive de-escalation training.
-
INTERACT: An AI-Driven Extended Reality Framework for Accesible Communication Featuring Real-Time Sign Language Interpretation and Emotion Recognition
INTERACT is an AI-driven XR framework providing real-time sign language interpretation and emotion recognition for accessible virtual communication, with pilot tests showing 92% user satisfaction and high accuracy rates.
-
Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges
A survey that organizes methods for cross-domain object detection into a taxonomy, analyzes domain shift across detection stages, and outlines persistent challenges.