{"total":59,"items":[{"citing_arxiv_id":"2606.31844","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging Local Observation and Global Simulation in Closed-Loop Traffic Modeling","primary_cat":"cs.RO","submitted_at":"2026-06-30T15:45:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CRAFT reduces collisions by 31.2% and traffic violations by 33.2% in closed-loop traffic simulation by discovering context-induced failures in what-if rollouts and using a contextual preference evaluator to reweight autoregressive decoding toward globally coherent behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31814","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Lane Topology Reasoning via Autoregressive Model with Geometry Prior","primary_cat":"cs.CV","submitted_at":"2026-06-30T15:27:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TopoGPT pre-trains an autoregressive transformer on serialized lane graphs from 3.3M scenes to learn geometry priors and uses a perception adapter to apply it to BEV features for improved lane graph prediction on OpenLane-V2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29716","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AerialMetric: Benchmarking and Adapting UAV Monocular Metric Depth Estimation in the Real World","primary_cat":"cs.CV","submitted_at":"2026-06-29T02:48:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AerialMetric is a new benchmark dataset and evaluation suite for adapting monocular metric depth estimation models to real-world UAV aerial views.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27317","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OctoSense: Self-Supervised Learning for Multimodal Robot Perception","primary_cat":"cs.CV","submitted_at":"2026-06-25T17:30:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OctoSense supplies a large multimodal robotics dataset and a late-fusion masked autoencoder that runs fast and outperforms image-only models on optical flow, depth, segmentation, and ego-motion tasks while remaining robust under sensor degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26424","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Training & Inference for Forecasting: Linking Winner-Take-All back to GMMs","primary_cat":"cs.LG","submitted_at":"2026-06-24T22:26:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Links WTA training mismatch in GMM-modeled forecasters to uninformative posteriors and introduces post-hoc merging plus one-step EM to yield better-ranked mode probabilities without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22617","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniSpace: Efficient Geometry Awareness for Autonomous Vehicles MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-21T17:47:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniSpace is a plug-and-play method that improves spatial reasoning in MLLMs for AV by injecting camera pose, using epipolar attention across views, and distilling 3D geometric knowledge to overcome weak cross-view correspondence and depth estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21344","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mind the Noise: Sensitivity of Transformer-based Interaction-Aware Trajectory Prediction Models to Noisy Data","primary_cat":"cs.AI","submitted_at":"2026-06-19T11:41:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Noise in input state data degrades Transformer trajectory prediction accuracy by factors of 1.3x to 3.9x under realistic conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20725","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"D2HDMap: Non-visible Driveline Map Prior for Online Vectorized HD Map Prediction","primary_cat":"cs.CV","submitted_at":"2026-06-17T00:05:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"D2HDMap uses a non-visible driveline prior to guide online vectorized HD map prediction, reaching 44.8 mAP on geographically disjoint splits of nuScenes and Argoverse 2 while retaining performance without the prior at inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19370","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Human-like autonomy emerges from self-play and a pinch of human data","primary_cat":"cs.LG","submitted_at":"2026-06-11T19:16:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Self-play RL regularized with 30 minutes of human data produces driving policies that coordinate with humans, training in 15 hours on one GPU with 2500x less data than imitation learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17080","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HRDX: A Large-Scale Vector HD-Map Dataset","primary_cat":"cs.RO","submitted_at":"2026-06-11T15:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HRDX is a 1400 km vector HD-map dataset with multi-sensor capture, aerial orthoimagery, 10 classes and 20+ attributes, plus benchmarks showing scale and aerial data improve map construction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11874","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AutoMine Solution for AV2 2026 Scenario Mining Challenge","primary_cat":"cs.AI","submitted_at":"2026-06-10T09:58:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"AutoMine applies LLMs and VLMs with self-refining code generation to scenario mining and reports 36.38 HOTA-Temporal and 77.21 Timestamp BA on the AV2 2026 challenge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11739","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-View In-Cabin Monitoring System for Public Transport Vehicles","primary_cat":"cs.CV","submitted_at":"2026-06-10T07:16:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a 9136-sample multi-view in-cabin dataset from a German city bus with RGB, depth, LiDAR, 3D annotations via pseudo-labeling, nuScenes conversion, and benchmarks on models like BEVFusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11120","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football","primary_cat":"cs.AI","submitted_at":"2026-06-09T17:16:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MCPS adapts a trajectory generator from autonomous driving to simulate counterfactual 3D pass outcomes in football and produces distribution-aware execution-surplus scores from value model rollouts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10641","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CAMASA: A CAM-based Dataset from the MASA Living Lab","primary_cat":"cs.NI","submitted_at":"2026-06-09T09:45:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CAMASA is a real-world V2X dataset with 40M+ CAMs and 2M+ DENMs from Modena for C-ITS trajectory prediction and simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09882","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory","primary_cat":"cs.CV","submitted_at":"2026-06-03T06:14:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while exposing cross-city gaps and long-tailed defect vulnerabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02379","ref_index":38,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Honey, I Shrunk the Arc de Triomphe!","primary_cat":"cs.CV","submitted_at":"2026-06-01T15:28:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31572","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-29T17:40:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"nuReasoning is a new real-world dataset and benchmark extending nuScenes/nuPlan with 20k clips and multi-type reasoning annotations to evaluate and improve reasoning in long-tail autonomous driving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30561","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLM3: Vision Language Models Are Native 3D Learners","primary_cat":"cs.CV","submitted_at":"2026-05-28T20:48:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28552","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-27T14:44:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SMamba-DDPG trains separate policies on Argoverse 2 safety-critical interactions to reproduce pedestrian avoidance, finding faster reactions, lower speeds, and fewer conflicts with AVs than HDVs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24037","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-21T11:37:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Mode-as-Sequence turns unordered multimodal trajectory sets into ordered sequences with explicit mode dependencies via recurrent or parallel decoding plus EMTA loss, yielding top rankings on Waymo motion prediction challenges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20390","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"STELLAR: Scaling 3D Perception Large Models for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-19T18:40:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19038","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic","primary_cat":"cs.RO","submitted_at":"2026-05-18T19:00:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STRELGen combines a multi-agent diffusion model with differentiable STREL specifications to optimize latent space for generating plausible yet safety-critical driving scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18074","ref_index":75,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-05-18T08:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"4DLidarOpen is a new open dataset providing synchronized 4D FMCW Lidar velocity measurements, multi-Lidar and camera data, and 3D bounding-box annotations with track IDs to support benchmarks on 3D detection, BEV segmentation, flow prediction, and motion forecasting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17229","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions","primary_cat":"cs.RO","submitted_at":"2026-05-17T02:30:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A three-stage framework pre-trains multi-agent RL agents on real safety-critical data, refines them via online learning in CARLA, and generates the VPSCI dataset of over 198,000 realistic vehicle-pedestrian interaction episodes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15876","ref_index":59,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unlocking Dense Metric Depth Estimation in VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-15T11:54:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10026","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MUSDA: Multi-source Multi-modality Unsupervised Domain Adaptive 3D Object Detection for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-11T05:50:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MUSDA proposes hierarchical domain classifiers for multi-modality feature alignment and a prototype graph strategy for multi-source prediction fusion in unsupervised domain adaptation for 3D object detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09619","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GSMap: 2D Gaussians for Online HD Mapping","primary_cat":"cs.CV","submitted_at":"2026-05-10T15:57:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GSMap represents HD map elements as sequences of 2D Gaussians to unify geometric precision and topological regularity for online autonomous driving maps.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"This design enables GSMap to generate topologically consistent and geometrically precise HD maps in a single forward pass, while maintaining com- parable inference efficiency to existing vectorization-based models. 4 Experiment 4.1 Experiment Setup Datasets.We evaluate our method on two large-scale autonomous driving datasets: nuScenes [1] and Argoverse2 [32]. The nuScenes dataset comprises 1000 driving scenes, each providing multi-view RGB images from six surround cam- eras.Followingstandardpractice[1],wesplitthedatainto700scenesfortraining and 150 for validation. Argoverse2 contains 1,000 scenes with seven-camera RGB imagery. We adopt the same split protocol as prior work [32], using 700 scenes for training and 150 scenes for validation."},{"citing_arxiv_id":"2605.09425","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation","primary_cat":"cs.CV","submitted_at":"2026-05-10T08:56:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Simoncelli, and Alan C. Bovik. Multiscale structural similarity for image quality assessment. InThe Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 2003. [86] Maolin Wei, Wanzhou Liu, and Eshed Ohn-Bar. Driveqa: Passing the driving knowledge test. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. [87] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. [88] Magnus Wrenninge and Jonas Unger."},{"citing_arxiv_id":"2605.08911","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-09T12:12:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"each connection query to capture topology-specific cues (e.g., connectivity), thereby enhancing the supervision for predicting topological relationships. IV. EXPERIMENTS A. Datasets and Metrics To evaluate our proposed method, we conduct experiments on the topology reasoning benchmark OpenLane-V2 [6]. OpenLane-V2 [6] comprises two subsets,subset Aandsub- set B, which are annotated based on the Argoverse2 [2] and nuScenes [1] datasets, respectively. Each subset contains 1000 scenes with annotations for lane centerlines, traffic elements, lane-to-lane topology, and lane-to-traffic element topology. Subset Aprovides seven camera views per frame, whereas subset Boffers six camera views per frame. The evaluation metrics consist of four components, all based"},{"citing_arxiv_id":"2605.05014","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography","primary_cat":"cs.CV","submitted_at":"2026-05-06T15:13:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CARD is a new multi-modal driving dataset delivering ~500K dense depth pixels per frame from challenging road topographies using stereo cameras and fused LiDARs over 110 km.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02762","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unified Map Prior Encoder for Mapping and Planning","primary_cat":"cs.CV","submitted_at":"2026-05-04T16:01:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UMPE fuses any subset of HD/SD vector maps, raster SD maps, and satellite imagery into BEV features via alignment-aware vector and raster branches, raising mapping mAP by 5.3-5.9 points and cutting planning L2 error by 0.30 m on nuScenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01478","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LIE: LiDAR-only HD Map Construction with Intensity Enhancement via Online Knowledge Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-02T14:52:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIE delivers LiDAR-only HD map segmentation via online knowledge distillation that fuses intensity maps, beating the best camera-only model by 8.2% mIoU on nuScenes while adapting quickly to new datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00907","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation","primary_cat":"cs.CV","submitted_at":"2026-04-29T04:29:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"US Highway 101 Dataset, FHWA-HRT-07-030. Available at: https: //www.fhwa.dot.gov/publications/research/operations/07030/index.cfm. [16] Barmpounakis, E., and N. Geroliminis. On the New Era of Urban Traffic Monitoring with Massive Drone Data: The pNEUMA Large-Scale Field Experiment. Transportation Research Part C: Emerging Technologies, 2020, 111: 50-71. doi:10.1016/j.trc.2019.11.023. [17] Wilson, B., W. Qi, T. Agarwal, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting. arXiv, 2023. doi:10.48550/arXiv.2301.00493. [18] Li, Y., R. Yu, C. Shahabi, et al. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv, 2018. doi:10.48550/arXiv.1707.01926. [19] California Department of Transportation."},{"citing_arxiv_id":"2604.24119","ref_index":29,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations","primary_cat":"cs.CV","submitted_at":"2026-04-27T07:13:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TopoHR introduces hierarchical point/instance/semantic queries and a unified P2I+I2I topology module that reports SOTA gains on OpenLane-V2 subsets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23934","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions","primary_cat":"eess.SY","submitted_at":"2026-04-27T01:19:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLM-VPI uses Qwen3-VL and GPT-OSS models for pedestrian intent and age reasoning plus a tiered safety controller, reporting 92.3% intent accuracy in CARLA and reduced conflicts versus rule-based and supervised baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22851","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-04-22T07:49:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"referential, requiring the grounding of the agent's own motion state within the temporal visual stream.EgoDyn-Benchaddresses this gap by providing a struc- tured diagnostic to evaluate whether a model's high-level semantic interpretation of its own movement is accurately anchored in physical concepts. Trajectory Forecasting and Control.Standard benchmarks like Argo- verse [40], ScenePilot-Bench [39], and EgoTraj-Bench [20] evaluate motion via displacement-based metrics. However, spatial accuracy does not guarantee kine- matic feasibility or compliance with underlying physical concepts. Instead of assessing motion generation,EgoDyn-Benchprovides an isolated diagnostic of the model's intrinsic high-level physical understanding, evaluating whether its"},{"citing_arxiv_id":"2604.18486","ref_index":106,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:37:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Boris Ivanovic, and Marco Pavone. Alpamayo-R1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025. [105] Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317, 2025. 29 [106] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. [107] Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and"},{"citing_arxiv_id":"2604.18476","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:28:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17024","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras","primary_cat":"cs.CV","submitted_at":"2026-04-18T15:14:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16783","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EdgeVTP: Exploration of Latency-efficient Trajectory Prediction for Edge-based Embedded Vision Applications","primary_cat":"cs.CV","submitted_at":"2026-04-18T02:13:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EdgeVTP delivers the lowest measured end-to-end latency on Jetson-class platforms while matching or exceeding state-of-the-art accuracy on highway trajectory benchmarks by using bounded graph interactions and a one-shot curve decoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12857","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic","primary_cat":"cs.AI","submitted_at":"2026-04-14T15:09:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"trajectory generation, driver modeling, and scenario gener- ation). We monitored key venues from both the machine learning and transportation communities. Citation tracking and snow balling was also used from foundational and recent papers to identify additional relevant works. We also tracked major benchmarks and challenges, including the Waymo Open Motion Dataset [23], Argoverse [24], nuPlan [25], and the Waymo Open Sim Agents Challenge (WOSAC) [26]. For cognitive and physics-informed methods, we also explored the human factors and cognitive science literature, including journals focusing on Human Factors and Cognitive Science. For each methodological category, we selected 7-15 representative papers based on five criteria: (1) a foundational"},{"citing_arxiv_id":"2604.11400","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EagleVision: A Multi-Task Benchmark for Cross-Domain Perception in High-Speed Autonomous Racing","primary_cat":"cs.RO","submitted_at":"2026-04-13T12:42:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EagleVision creates a standardized multi-task benchmark for LiDAR perception in high-speed autonomous racing, with experiments showing that pretraining on racing data improves cross-domain detection and prediction performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08509","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Visually-grounded Humanoid Agents","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:50:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"We summarize the roles of experiment scenes in world-layer and agent-layer evaluation together with key scene properties. Scene World Eval. Agent Eval. Universality Eval. Recon. Quality Inter. Range Sem. Richness Coll. Mesh SmallCity [34] ✓ ✓ ✓ High Massive High ✓ XGRIDS [104] ✓ ✓ Ultra Massive High ✓ SAGE-3D [56] ✓ ✓ High Room-scale Medium ✓ ArgoVerse2 [101], PandaSet [106] ✓ Medium Large Vehicle-only ✓ Mip-NeRF360 [5], DL3DV-10K [44] ✓ Medium Limited Low/Medium ✓ MatrixCity [39], Horizon-GS [26] ✓ Low Massive High ✓ SuperSplat [66], Pointcosm [85] ✓ Ultra Large Medium ✗ Thus, we cast high-level planning as a selection problem over discrete human-centric action primitives [61], avoiding the inherent limitations of VLMs [76] in continuous control"},{"citing_arxiv_id":"2604.08626","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WildDet3D: Scaling Promptable 3D Detection in the Wild","primary_cat":"cs.CV","submitted_at":"2026-04-09T16:00:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"For example, when asked to \"locate the most expensive object in this scene,\" the V ST Qw e n3- VL W ildDet3D-Age nt Find t he closest person in t his scene. Which f ood has t he highest calories? Which pla y er just hit t he ball? What is t he most e xpensiv e object? Figure 10 WildDet3D-agent: referring expression localization.Results of 3D box outputs by WildDet3D compared to VST [57] and Qwen3-VL [56]. WildDet3D-Agent more reliably localizes the queried object. 20 VLM correctly reasons and recognizes that the computer is probably most expensive and grounds it with a 2D box, after which WildDet3D produces the corresponding 3D cuboid. When VLM models such as VST or Qwen3-VL are asked to directly produce the 3D box, they both provided (rather inaccurate) boxes for"},{"citing_arxiv_id":"2604.06332","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection","primary_cat":"cs.CV","submitted_at":"2026-04-07T18:13:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05908","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-04-07T14:11:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ADM-GS decomposes static background appearance into traversal-invariant material and traversal-dependent illumination via a frequency-separated neural light field, yielding +0.98 dB PSNR gains and better cross-traversal consistency on Argoverse 2 and Waymo data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04887","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes","primary_cat":"cs.CV","submitted_at":"2026-04-06T17:36:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HorizonWeaver enables photorealistic, instruction-driven multi-level editing of complex driving scenes with improved generalization via a new paired dataset, language-guided masks, and joint training losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04737","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LEAN-3D: Low-latency Hierarchical Point Cloud Codec for Mobile 3D Streaming","primary_cat":"eess.SP","submitted_at":"2026-04-06T15:04:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LEAN-3D delivers 3-5x lower latency and up to 5.1x lower edge energy for learned point cloud compression on mobile hardware by restricting learned components to shallow hierarchy levels and using deterministic coding deeper in the tree.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02903","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection","primary_cat":"cs.CV","submitted_at":"2026-04-03T09:20:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RayMamba improves long-range 3D object detection by ray-aligned serialization of sparse voxels for state space modeling, delivering up to 2.49 mAP gain on nuScenes in the 40-50 m range.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"for safe motion planning. However, long-range perception remains a major challenge for LiDAR-based detectors. On the nuScenes dataset [1], which uses a 32-beam LiDAR sensor, objects beyond roughly 40 meters are typically represented by fewer than ten returns due to distance-induced sparsity and foreground occlusion, as illustrated in Fig. 1. Argoverse 2 [2], which uses dual 32-beam LiDARs, exhibits a similar high- sparsity issue beyond roughly 50 meters. This severe degra- dation makes accurate long-range 3D detection substantially more difficult than near-range and mid-range perception. This difficulty also poses a challenge to existing detector architectures. Earlier methods, such as sparse convolution-"},{"citing_arxiv_id":"2604.01044","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A global dataset of continuous urban dashcam driving","primary_cat":"cs.CV","submitted_at":"2026-04-01T15:52:17+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CROWD is a new global dataset of 51,753 continuous urban dashcam segments spanning over 20,000 hours from 238 countries, with manual labels and automated object detections for routine driving analysis.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"01044v2 [cs.CV] 7 Apr 2026 Curated benchmarks for driving perception, spanning general st reet scene driving datasets such as KITTI [8], Cityscapes [9], ApolloScape [10], BDD100K [11], Mapillary Vistas [12], KITTI 360 [13], and A2D2 [14], as well as autonomous vehicle focussed multi sensor data sets such as Argoverse [15], Argoverse 2 [16], nuScenes [17], the Waymo Open Dataset [18], and PandaSet [19 ], have accelerated methodological advances by providing high quality sensor data and annotations. Th ese benchmarks are fundamental for detection, segmentation, tracking, and forecasting, but they a lso reﬂect practical constraints of collection. Geographic coverage is often limited to a small number of cities or reg ions, many releases prioritise short"}],"limit":50,"offset":0}