arxiv: 2604.21841 · v1 · submitted 2026-04-23 · 💻 cs.CR

Recognition: unknown

Cross-Modal Phantom: Coordinated Camera-LiDAR Spoofing Against Multi-Sensor Fusion in Autonomous Vehicles

Shahriar Rahman Khan , Raiful Hasan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:13 UTC · model grok-4.3

classification 💻 cs.CR

keywords autonomous vehiclesmulti-sensor fusioncamera-LiDAR spoofingcross-modal attacksperception securityadversarial attacksphantom objects

0 comments

The pith

Coordinated camera and LiDAR spoofing creates consistent phantom objects that bypass the redundancy of multi-sensor fusion in autonomous vehicles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an attacker can deceive an autonomous vehicle's perception system by making the camera and LiDAR sensors report the same false object at the same location. This coordinated approach exploits the fusion process, which is designed to improve reliability by combining data from multiple sensors, but instead accepts the agreement as proof of reality. If the finding holds, it means the built-in safety feature of sensor redundancy can be defeated without attacking every sensor independently. The work uses simulations of infrared image patches and matching 3D point clusters on real driving data to test the idea at scale. Results indicate the attack succeeds against a current perception model in the large majority of tested cases.

Core claim

We design a coordinated, data-level (early-fusion) attack that emulates the outcome of two synchronized physical spoofing sources: an infrared projection that induces a false camera detection and a LiDAR signal injection that produces a matching 3D point cluster. Rather than implementing the physical attack hardware, we simulate its sensor-level outcomes by inserting perspective-aware image patches and synthetic LiDAR point clusters aligned in 3D space. Using 400 KITTI scenes, our large-scale evaluation shows that the coordinated spoofing deceives a state-of-the-art perception model with an 85.5% successful attack rate. These findings provide the first quantitative evidence that malicious跨模态

What carries the argument

Perspective-aware image patches and 3D-aligned synthetic LiDAR point clusters that fabricate cross-sensor consistency for a false object while preserving the perceptual effects of real synchronized physical attacks.

If this is right

Multi-sensor fusion no longer provides robust protection once an attacker can force consistent but false readings across modalities.
The data-fusion logic itself becomes the attack surface when cross-modal agreement is fabricated rather than verified.
Sensor-level simulation of physical attacks is sufficient to expose the vulnerability without needing full hardware implementation.
AV perception pipelines that rely on early fusion are exposed to this class of consistency-based attacks at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future defenses could add independent consistency validation or cross-modal anomaly detection before trusting fused output.
The simulation-to-real gap identified in the weakest assumption suggests targeted physical experiments as the next direct test.
Similar coordinated spoofing could be explored against other sensor pairs such as radar and camera in the same fusion framework.

Load-bearing premise

The simulated image patches and point clusters produce the same sensor outputs and fusion behavior as actual physical infrared projections and electromagnetic signal injections would on real hardware.

What would settle it

Run synchronized physical attacks using an infrared projector and a LiDAR signal injector on a real vehicle with the same perception model, then measure whether the reported success rate for phantom object detection matches the 85.5 percent from simulation.

Figures

Figures reproduced from arXiv: 2604.21841 by Raiful Hasan, Shahriar Rahman Khan.

**Figure 1.** Figure 1: The main contributions of this paper are as follows: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Validation of the LiDAR Spoofing Attack (Condition 2). This figure [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Autonomous Vehicles (AVs) increasingly depend on Multi-Sensor Fusion (MSF) to combine complementary modalities such as cameras and LiDAR for robust perception. While this redundancy is intended to safeguard against single-sensor failures, the fusion process itself introduces a subtle and underexplored vulnerability. In this work, we investigate whether an attacker can bypass MSF's redundancy by fabricating cross-sensor consistency, making multiple sensors agree on the same false object. We design a coordinated, data-level (early-fusion) attack that emulates the outcome of two synchronized physical spoofing sources: an infrared (IR) projection that induces a false camera detection and a LiDAR signal injection that produces a matching 3D point cluster. Rather than implementing the physical attack hardware, we simulate its sensor-level outcomes by inserting perspective-aware image patches and synthetic LiDAR point clusters aligned in 3D space. This approach preserves the perceptual effects that real IR and IEMI-based spoofing would create at the sensor output. Using 400 KITTI scenes, our large-scale evaluation shows that the coordinated spoofing deceives a state-of-the-art perception model with an 85.5% successful attack rate. These findings provide the first quantitative evidence that malicious cross-modal consistency can compromise MSF-based perception, revealing a critical vulnerability in the core data-fusion logic of modern autonomous vehicle systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Simulation of coordinated camera-LiDAR spoofing fools MSF on KITTI data at 85.5% but lacks any physical hardware validation.

read the letter

This paper simulates a coordinated attack that plants the same false object into both camera images and LiDAR clouds so multi-sensor fusion accepts it as real. They insert perspective-aware patches and aligned point clusters on 400 KITTI scenes and report 85.5% success against a state-of-the-art perception stack. The core idea is that forcing cross-modal consistency at the data level can defeat the redundancy MSF is supposed to provide. That part is new relative to single-sensor spoofing papers. The evaluation is straightforward and uses public data, which makes the numbers easy to inspect. They also spell out how the simulated artifacts are meant to stand in for real IR projection and IEMI injection. The main weakness is that none of this is checked against actual sensor hardware. The inserted patches and clusters may not match the noise, beam patterns, or timing that real synchronized attacks would produce, so the transfer to physical AVs is unproven. The abstract also leaves out the exact fusion model and the precise definition of a successful attack, though the full text may supply those details. Readers working on AV perception security or fusion robustness will find the scenario worth considering. The work is coherent on its own terms and engages the existing literature on sensor attacks, so it is worth sending to peer review. Reviewers will need to press on simulation fidelity and baseline comparisons, but the question it raises is real enough to justify the effort.

Referee Report

2 major / 2 minor

Summary. The paper introduces a coordinated data-level attack on multi-sensor fusion (MSF) perception in autonomous vehicles. It emulates synchronized physical spoofing—an IR projection creating a false camera detection and IEMI-based LiDAR point injection creating a matching 3D cluster—by inserting perspective-aware image patches and aligned synthetic LiDAR points. Large-scale evaluation on 400 KITTI scenes reports an 85.5% attack success rate against a state-of-the-art perception model, positioning this as the first quantitative demonstration that malicious cross-modal consistency can defeat MSF redundancy.

Significance. If the simulated sensor artifacts faithfully reproduce real physical attacks, the result would be significant: it shows that MSF's intended robustness can be inverted by enforcing cross-sensor agreement on false objects rather than attacking modalities independently. The scale (400 public scenes) and explicit simulation approach are strengths that enable reproducibility and falsifiable follow-up work.

major comments (2)

[Abstract] Abstract and evaluation description: the 85.5% success rate is presented without specifying the exact MSF architecture, the precise definition of 'successful attack' (e.g., IoU threshold, detection score, or 3D consistency metric), the ranges of attack parameters, or any single-sensor baseline comparisons. These omissions make it impossible to determine whether the result actually demonstrates bypassing of redundancy or simply reflects a weak fusion implementation.
[Abstract] Abstract (simulation paragraph): the central claim that the attack 'deceives a state-of-the-art perception model' and reveals a 'critical vulnerability in the core data-fusion logic' rests on the untested assumption that perspective-aware patches plus synthetic LiDAR clusters produce sensor outputs equivalent to real synchronized IR projection and IEMI spoofing. No hardware validation, noise-characteristic comparison, or ablation on sensor-specific responses is provided; any mismatch in beam patterns, timing, or response functions would invalidate transfer to the claimed real-world MSF compromise.

minor comments (2)

Add a dedicated subsection or table enumerating the exact MSF model, fusion weights, and detection thresholds used in the 400-scene experiments.
Clarify how 3D alignment between inserted image patches and LiDAR clusters is maintained across varying KITTI camera-LiDAR extrinsics and scene depths.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We appreciate the referee's thorough review and constructive feedback. Below we provide point-by-point responses to the major comments. We have revised the manuscript to address the concerns regarding evaluation details and simulation assumptions.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: the 85.5% success rate is presented without specifying the exact MSF architecture, the precise definition of 'successful attack' (e.g., IoU threshold, detection score, or 3D consistency metric), the ranges of attack parameters, or any single-sensor baseline comparisons. These omissions make it impossible to determine whether the result actually demonstrates bypassing of redundancy or simply reflects a weak fusion implementation.

Authors: We thank the referee for highlighting this. The full paper specifies the MSF architecture in the evaluation section as a particular state-of-the-art model, defines successful attack based on the model detecting the false object with sufficient IoU and score, provides parameter ranges in the attack generation, and includes single-sensor baselines. However, to make this immediately clear from the abstract, we have revised the abstract to incorporate concise mentions of these elements. This revision clarifies that the high success rate is due to the coordinated consistency rather than a weak fusion system, as the baselines show lower rates for single modalities. revision: yes
Referee: [Abstract] Abstract (simulation paragraph): the central claim that the attack 'deceives a state-of-the-art perception model' and reveals a 'critical vulnerability in the core data-fusion logic' rests on the untested assumption that perspective-aware patches plus synthetic LiDAR clusters produce sensor outputs equivalent to real synchronized IR projection and IEMI spoofing. No hardware validation, noise-characteristic comparison, or ablation on sensor-specific responses is provided; any mismatch in beam patterns, timing, or response functions would invalidate transfer to the claimed real-world MSF compromise.

Authors: We agree that hardware validation would provide stronger evidence for real-world transferability. Our work focuses on a simulation of the sensor outputs to enable large-scale quantitative evaluation, which is a common approach in security research for AV perception attacks. In the revised manuscript, we have expanded the simulation description to include more details on how the patches and points are generated to match typical sensor characteristics, added comparisons to noise models from related physical attack papers, and included an ablation study on sensor response variations. We also explicitly discuss the limitations of the simulation approach and the assumptions made. This addresses the concern while maintaining the paper's scope as a data-level study. revision: partial

standing simulated objections not resolved

Full hardware validation of the coordinated IR and IEMI spoofing, which was not performed as the study is simulation-based.

Circularity Check

0 steps flagged

No circularity: empirical attack simulation on public data

full rationale

The paper presents a purely empirical evaluation: it simulates coordinated spoofing by inserting perspective-aware image patches and aligned synthetic LiDAR clusters into 400 KITTI scenes, then measures an 85.5% attack success rate against a state-of-the-art perception model. No equations, derivations, fitted parameters, or predictions are claimed. No self-citations are used to justify uniqueness or load-bearing premises. The central result is a direct, falsifiable measurement on an external public dataset; the simulation method is described explicitly without reducing to its own outputs by construction. This is a standard empirical demonstration with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified equivalence between the authors' synthetic sensor outputs and real physical spoofing hardware effects, plus the assumption that KITTI scenes and the chosen perception model are representative of deployed AV systems.

axioms (2)

domain assumption Simulated perspective-aware patches and aligned 3D point clusters produce the same perceptual effect on the fusion model as real IR projection and LiDAR signal injection would.
Stated in the abstract as the justification for using simulation instead of physical hardware.
domain assumption The state-of-the-art perception model and KITTI dataset distribution are representative of real-world MSF behavior in autonomous vehicles.
Implicit in the choice of evaluation data and model for claiming relevance to AV systems.

pith-pipeline@v0.9.0 · 5548 in / 1373 out tokens · 32750 ms · 2026-05-09T21:13:14.789950+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Tesla Robotaxi Reveal: How to Watch and What to Expect,

E. Stafford, “Tesla Robotaxi Reveal: How to Watch and What to Expect,” https://www.caranddriver.com/news/a62567491/tesla-robotaxi- reveal/, 2024

2024
[2]

Tesla’s Musk Unveils Robotaxis Amid Fan- fare and Skepticism,

A. Roy and A. Sriram, “Tesla’s Musk Unveils Robotaxis Amid Fan- fare and Skepticism,” https://www.reuters.com/technology/teslas-musk- unveil-robotaxis-amid-fanfare-skepticism-2024-10-10/, 2024

2024
[3]

Driverless Buses Are Coming: An Inside Look at the Technology Behind Them,

W. D. Jones, “Driverless Buses Are Coming: An Inside Look at the Technology Behind Them,” https://spectrum.ieee.org/driverless-bus, 2025

2025
[4]

The Impact of Self-Driving Trucks on Jobs,

S. Clevenger, “The Impact of Self-Driving Trucks on Jobs,” https://www.ttnews.com/articles/self-driving-trucks-jobs, 2025

2025
[5]

Uber Eats and Waymo Team Up for Driverless Deliv- eries in Phoenix,

B. Shaban, “Uber Eats and Waymo Team Up for Driverless Deliv- eries in Phoenix,” https://www.nbcbayarea.com/news/tech/uber-waymo- driverless-deliveries/3500454/, 2024

work page arXiv 2024
[6]

Multi-sensor Fusion Perception of Vehicle Environment and its Application in Obstacle Avoidance of Au- tonomous Vehicle,

W. Li, X. Wan, Z. Ma, and Y . Hu, “Multi-sensor Fusion Perception of Vehicle Environment and its Application in Obstacle Avoidance of Au- tonomous Vehicle,”International Journal of Intelligent Transportation Systems Research, pp. 1–14, 2025

2025
[7]

Sensor Spoofing Detection On Autonomous Vehicle Using Channel-spatial-temporal Attention Based Autoencoder Network,

M. Zhou and L. Han, “Sensor Spoofing Detection On Autonomous Vehicle Using Channel-spatial-temporal Attention Based Autoencoder Network,”Mobile Networks and Applications, pp. 1–14, 2023

2023
[8]

EMI-LiDAR: Uncovering Vulnerabilities of LiDAR Sensors in Autonomous Driving Setting using Electromagnetic Interference,

S. H. V . Bhupathiraju, J. Sheldon, L. A. Bauer, V . Bindschaedler, T. Sug- awara, and S. Rampazzi, “EMI-LiDAR: Uncovering Vulnerabilities of LiDAR Sensors in Autonomous Driving Setting using Electromagnetic Interference,” inProceedings of the 16th ACM Conference on Security and Privacy in Wireless and Mobile Networks, 2023, pp. 329–340

2023
[9]

GhostImage: Remote Perception Attacks against Camera-based Image Classification Systems,

Y . Man, M. Li, and R. Gerdes, “GhostImage: Remote Perception Attacks against Camera-based Image Classification Systems,” in23rd Inter- national Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020), 2020, pp. 317–332

2020
[10]

Who Is in Control? Practical Physical Layer Attack and Defense for mmWave-Based Sensing in Autonomous Vehicles,

Z. Sun, S. Balakrishnan, L. Su, A. Bhuyan, P. Wang, and C. Qiao, “Who Is in Control? Practical Physical Layer Attack and Defense for mmWave-Based Sensing in Autonomous Vehicles,”IEEE Transactions on Information Forensics and Security, vol. 16, pp. 3199–3214, 2021

2021
[11]

Autonomous Vehicles Enabled by the Integration of IoT, Edge Intelligence, 5G, and Blockchain,

A. Biswas and H.-C. Wang, “Autonomous Vehicles Enabled by the Integration of IoT, Edge Intelligence, 5G, and Blockchain,”Sensors, vol. 23, no. 4, p. 1963, 2023

1963
[12]

Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection,

Z. Cheng, H. Choi, J. Liang, S. Feng, G. Tao, D. Liu, M. Zuzak, and X. Zhang, “Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection,”arXiv preprint arXiv:2304.14614, 2023

work page arXiv 2023
[13]

Security Analysis of Camera-LiDAR Fusion Against Black-Box Attacks on Autonomous Vehicles,

R. S. Hallyburton, Y . Liu, Y . Cao, Z. M. Mao, and M. Pajic, “Security Analysis of Camera-LiDAR Fusion Against Black-Box Attacks on Autonomous Vehicles,” in31st USENIX Security Symposium (USENIX Security 22), 2022, pp. 1903–1920

2022
[14]

YOLOv4: Optimal Speed and Accuracy of Object Detection

A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “YOLOv4: Op- timal Speed and Accuracy of Object Detection,”arXiv preprint arXiv:2004.10934, 2020

work page internal anchor Pith review arXiv 2004
[15]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,”IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2016

2016
[16]

Fooling the Eyes of Autonomous Vehicles: Robust Physical Adversarial Examples Against Traffic Sign Recognition Systems,

W. Jia, Z. Lu, H. Zhang, Z. Liu, J. Wang, and G. Qu, “Fooling the Eyes of Autonomous Vehicles: Robust Physical Adversarial Examples Against Traffic Sign Recognition Systems,”arXiv preprint arXiv:2201.06192, 2022

work page arXiv 2022
[17]

Revisiting Physical-World Adversarial Attack on Traffic Sign Recognition: A Commercial Systems Perspective,

N. Wang, S. Xie, T. Sato, Y . Luo, K. Xu, and Q. A. Chen, “Revisiting Physical-World Adversarial Attack on Traffic Sign Recognition: A Commercial Systems Perspective,”arXiv preprint arXiv:2409.09860, 2024

work page arXiv 2024
[18]

Invisible Reflections: Leveraging Infrared Laser Reflections to Target Traffic Sign Perception,

T. Sato, S. H. V . Bhupathiraju, M. Clifford, T. Sugawara, Q. A. Chen, and S. Rampazzi, “Invisible Reflections: Leveraging Infrared Laser Reflections to Target Traffic Sign Perception,”arXiv preprint arXiv:2401.03582, 2024

work page arXiv 2024
[19]

PointRCNN: 3D Object Proposal Genera- tion and Detection From Point Cloud,

S. Shi, X. Wang, and H. Li, “PointRCNN: 3D Object Proposal Genera- tion and Detection From Point Cloud,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 770– 779

2019
[20]

PhantomLiDAR: Cross-modality Signal Injection Attacks against LiDAR,

Z. Jin, Q. Jiang, X. Lu, C. Yan, X. Ji, and W. Xu, “PhantomLiDAR: Cross-modality Signal Injection Attacks against LiDAR,”arXiv preprint arXiv:2409.17907, 2024

work page arXiv 2024
[21]

PLA-LiDAR: Physical Laser Attacks against LiDAR-based 3D Object Detection in Autonomous Vehicle,

Z. Jin, X. Ji, Y . Cheng, B. Yang, C. Yan, and W. Xu, “PLA-LiDAR: Physical Laser Attacks against LiDAR-based 3D Object Detection in Autonomous Vehicle,” in2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023, pp. 1822–1839

2023
[22]

You Can’t See Me: Physical Removal Attacks on LiDAR- based Autonomous Vehicles Driving Frameworks,

Y . Cao, S. H. Bhupathiraju, P. Naghavi, T. Sugawara, Z. M. Mao, and S. Rampazzi, “You Can’t See Me: Physical Removal Attacks on LiDAR- based Autonomous Vehicles Driving Frameworks,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 2993–3010

2023
[23]

Drift with Devil: Security of Multi-Sensor Fusion based Localization in High-Level Autonomous Driving under GPS Spoofing,

J. Shen, J. Y . Won, Z. Chen, and Q. A. Chen, “Drift with Devil: Security of Multi-Sensor Fusion based Localization in High-Level Autonomous Driving under GPS Spoofing,” in29th USENIX security symposium (USENIX Security 20), 2020, pp. 931–948

2020
[24]

Malicious Attacks against Multi-Sensor Fusion in Autonomous Driving,

Y . Zhu, C. Miao, H. Xue, Y . Yu, L. Su, and C. Qiao, “Malicious Attacks against Multi-Sensor Fusion in Autonomous Driving,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 436–451

2024
[25]

Invisible for both Camera and LiDAR: Security of Multi-Sensor Fusion based Perception in Autonomous Driving Under Physical-World Attacks,

Y . Cao, N. Wang, C. Xiao, D. Yang, J. Fang, R. Yang, Q. A. Chen, M. Liu, and B. Li, “Invisible for both Camera and LiDAR: Security of Multi-Sensor Fusion based Perception in Autonomous Driving Under Physical-World Attacks,” in2021 IEEE symposium on security and privacy (SP). IEEE, 2021, pp. 176–194

2021
[26]

Play the Imitation Game: Model Extraction Attack against Autonomous Driving Localization,

Q. Zhang, J. Shen, M. Tan, Z. Zhou, Z. Li, Q. A. Chen, and H. Zhang, “Play the Imitation Game: Model Extraction Attack against Autonomous Driving Localization,” inProceedings of the 38th Annual Computer Security Applications Conference, 2022, pp. 56–70

2022
[27]

Vision meets Robotics: The KITTI Dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets Robotics: The KITTI Dataset,”The international journal of robotics research, vol. 32, no. 11, pp. 1231–1237, 2013

2013
[28]

MVX-Net: Multimodal V oxelNet for 3D Object Detection,

V . A. Sindagi, Y . Zhou, and O. Tuzel, “MVX-Net: Multimodal V oxelNet for 3D Object Detection,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 7276–7282

2019
[29]

PointPainting: Sequential Fusion for 3D Object Detection,

S. V ora, A. H. Lang, B. Helou, and O. Beijbom, “PointPainting: Sequential Fusion for 3D Object Detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4604–4612

2020
[30]

Pointpillars: Fast encoders for object detection from point clouds,

A. H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 697–12 705

2019
[31]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3354–3361

2012