pith. machine review for the scientific record. sign in

arxiv: 2604.11671 · v2 · submitted 2026-04-13 · 📡 eess.SP · cs.RO

Recognition: unknown

VLMaterial: Vision-Language Model-Based Camera-Radar Fusion for Physics-Grounded Material Identification

He Chen, Jiangyou Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:54 UTC · model grok-4.3

classification 📡 eess.SP cs.RO
keywords material identificationvision-language modelscamera-radar fusiondielectric constantmillimeter-wave radartraining-freephysics-groundedadaptive fusion
0
0 comments X

The pith

VLMaterial fuses vision-language models with radar dielectric constants to identify materials at 96.08% accuracy without training

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VLMaterial, a training-free framework that fuses vision-language models with millimeter-wave radar to recognize materials accurately even when they appear identical, such as glass versus plastic. An optical pipeline generates material candidate proposals using the segment anything model and VLM, while an electromagnetic pipeline extracts intrinsic dielectric constants from radar signals through the effective peak reflection cell area method and weighted vector synthesis. Context-augmented generation supplies radar-specific physical knowledge to the VLM, and an adaptive fusion step resolves conflicts via uncertainty estimation. This setup delivers 96.08% accuracy across more than 120 real-world experiments with 41 everyday objects and 4 visually deceptive counterfeits, matching closed-set benchmarks yet requiring no task-specific data collection or retraining. A sympathetic reader would care because it enables safer physical-world interaction for intelligent systems without the usual data or category restrictions.

Core claim

VLMaterial is a training-free framework for physics-grounded material identification that fuses vision-language models with radar-derived electromagnetic parameters. The system proposes material candidates via an optical pipeline using the segment anything model and VLM, extracts intrinsic dielectric constants using the effective peak reflection cell area method and weighted vector synthesis, equips the VLM with context-augmented generation for interpreting radar knowledge, and resolves cross-modal conflicts via adaptive fusion based on uncertainty estimation. In over 120 experiments involving 41 diverse everyday objects and 4 typical visually deceptive counterfeits across varying real-world

What carries the argument

Dual-pipeline architecture with effective peak reflection cell area (PRCA) method and weighted vector synthesis to extract dielectric constants from radar, combined with context-augmented generation (CAG) to integrate radar knowledge into the VLM and adaptive fusion based on uncertainty estimation.

If this is right

  • Material recognition extends to novel or open-set objects without collecting new labeled data or retraining for each category.
  • Intelligent systems gain the ability to distinguish physically different but visually similar materials for safer real-world interactions.
  • Radar provides interpretable physical parameters that enhance the semantic capabilities of vision-language models.
  • Performance holds across changing lighting and environments due to the physics-grounded fusion.
  • The framework matches specialized closed-set accuracy while remaining generalizable beyond fixed material lists.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding similar physical parameters from other sensors could improve VLM reliability in broader multi-modal perception tasks.
  • The PRCA extraction technique might generalize to characterize additional material properties when combined with different radar frequencies.
  • Robotic systems could adopt this fusion for manipulation tasks where material misidentification poses direct safety risks.
  • Testing the uncertainty estimation on conflicting data from additional modalities like thermal imaging could further reduce ambiguity.

Load-bearing premise

The effective peak reflection cell area method and weighted vector synthesis accurately extract stable dielectric constants from radar that serve as reliable references for the VLM, and uncertainty estimation correctly resolves cross-modal conflicts.

What would settle it

A controlled test set of visually deceptive objects where radar-derived dielectric constants deviate due to environmental interference, causing the fused output to misidentify materials at rates well above 10 percent.

Figures

Figures reproduced from arXiv: 2604.11671 by He Chen, Jiangyou Zhu.

Figure 1
Figure 1. Figure 1: All typical daily objects: 41 items across 6 materials and 7 categories (the paper kettle was excluded due to its rarity). [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validating inference stability via deterministic output verification. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experimental setups for data collection in different scenarios and detailed equipment configuration. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall object material recognition (VLM-only). [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation of VLMs on four radar data representations (Raw data, Point cloud, Heatmap, and ERF). [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of VLMaterial [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Electromagnetic characterization pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: (Left) Beam pattern of the antenna array; (Right) [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The impact of object positioning and surface curvature. [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Weighted vector summation. calibration phase ϕcali,j : Cj = exp (−j(ϕmeas,j − ϕcali,j )). (5) We denote Ij (v) ∈ C as the focused signal component for the j-th antenna. To ensure precise phase alignment, we apply the calibration phasor Cj to rectify hardware offsets and compensate for the geometric phase delay: Ij (v) = xj · Cj · exp  j 4π∥pj − v∥ λ  , (6) where xj is the raw received signal. The cohere… view at source ↗
Figure 12
Figure 12. Figure 12: CAG-Enhanced VLMaterial. rp. According to the relationship between rp and the dielectric constant εr (see Appendix B), we finally obtain: rp = εr cos θ − p εr − sin2 θ εr cos θ + p εr − sin2 θ . (15) Next, we define the variables as follows: A = εr cos θ, B = q εr − sin2 θ. (16) The final relationship between εr, rp and θ simplifies to: εr = (rp + 1)2 " 1 ± r 1 −  sin 2θ(rp−1) rp+1 2 # 2 cos2 θ(rp − 1)2… view at source ↗
Figure 13
Figure 13. Figure 13: Overall object material recognition (LLMaterial). [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Overall object material recognition (Radar-only (VLMaterial)). [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Overall object material recognition (VLMaterial). [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Material recognition accuracy. TABLE III: Performance comparison of existing methods. Method LLMaterial [18] VLM-only Radar-only (VLMaterial) VLMaterial mSense [21] CRFUSION [27] Training-free ✓ ✗ Recognition accuracy 19.69% 78.74% 41.73% 96.08% 93.00% 96.00% Ultimately, VLMaterial achieves an accuracy of 96.08%. Baseline Comparisons. Recognition accuracies for seven material types are summarized in [PIT… view at source ↗
Figure 17
Figure 17. Figure 17: Plastic boards (different sizes, same thickness). [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: Scene (50 cm, bright light) [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗
Figure 22
Figure 22. Figure 22: Scene (25 cm, dim light) [PITH_FULL_IMAGE:figures/full_fig_p013_22.png] view at source ↗
Figure 25
Figure 25. Figure 25: VLM-only (different camera) [PITH_FULL_IMAGE:figures/full_fig_p013_25.png] view at source ↗
Figure 28
Figure 28. Figure 28: Counterfeit objects. (a) Aluminum-plastic panel (ceramic-like finish). (b) Ceramic board (wood-like appearance). [PITH_FULL_IMAGE:figures/full_fig_p013_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Comparison of estimated dielectric constants with [PITH_FULL_IMAGE:figures/full_fig_p017_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Comparison of material recognition results for the [PITH_FULL_IMAGE:figures/full_fig_p017_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Overall object material recognition via prompt-guided camera-radar fusion. [PITH_FULL_IMAGE:figures/full_fig_p018_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Comparison of material recognition results for four [PITH_FULL_IMAGE:figures/full_fig_p018_32.png] view at source ↗
read the original abstract

Accurate material recognition is a fundamental capability for intelligent perception systems to interact safely and effectively with the physical world. For instance, distinguishing visually similar objects like glass and plastic cups is critical for safety but challenging for vision-based methods due to specular reflections, transparency, and visual deception. While millimeter-wave (mmWave) radar offers robust material sensing regardless of lighting, existing camera-radar fusion methods are limited to closed-set categories and lack semantic interpretability. In this paper, we introduce VLMaterial, a training-free framework that fuses vision-language models (VLMs) with domain-specific radar knowledge for physics-grounded material identification. First, we propose a dual-pipeline architecture: an optical pipeline uses the segment anything model and VLM for material candidate proposals, while an electromagnetic characterization pipeline extracts the intrinsic dielectric constant from radar signals via an effective peak reflection cell area (PRCA) method and weighted vector synthesis. Second, we employ a context-augmented generation (CAG) strategy to equip the VLM with radar-specific physical knowledge, enabling it to interpret electromagnetic parameters as stable references. Third, an adaptive fusion mechanism is introduced to intelligently integrate outputs from both sensors by resolving cross-modal conflicts based on uncertainty estimation. We evaluated VLMaterial in over 120 real-world experiments involving 41 diverse everyday objects and 4 typical visually deceptive counterfeits across varying environments. Experimental results demonstrate that VLMaterial achieves a recognition accuracy of 96.08%, delivering performance on par with state-of-the-art closed-set benchmarks while eliminating the need for extensive task-specific data collection and training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. VLMaterial introduces a training-free camera-radar fusion framework for material identification that combines vision-language models (VLMs) with mmWave radar physics. The dual-pipeline architecture uses SAM and VLM for optical candidate proposals, extracts intrinsic dielectric constants via the effective peak reflection cell area (PRCA) method and weighted vector synthesis in the electromagnetic pipeline, applies context-augmented generation (CAG) to inject radar knowledge into the VLM, and employs adaptive fusion with uncertainty estimation to resolve cross-modal conflicts. The paper reports 96.08% recognition accuracy across 120 real-world experiments involving 41 everyday objects and 4 visually deceptive counterfeits, claiming parity with closed-set SOTA benchmarks without task-specific training.

Significance. If the PRCA-based extraction and uncertainty-driven fusion prove robust, the work offers a meaningful advance in open-set, interpretable multimodal sensing by leveraging pre-trained VLMs and domain physics to reduce data collection burdens. The emphasis on real-world testing with deceptive counterfeits and the training-free design are practical strengths that could benefit robotics, autonomous navigation, and safety applications where visual ambiguity (e.g., glass vs. plastic) must be resolved reliably.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of 96.08% accuracy is presented without quantitative baselines, error bars, ablation studies on PRCA extraction or fusion components, or statistical validation of the uncertainty estimation. This absence prevents assessment of whether the result is on par with SOTA closed-set methods or if the training-free pipeline incurs hidden performance costs.
  2. [§3.2] §3.2 (Electromagnetic Characterization Pipeline): The PRCA method and weighted vector synthesis are asserted to extract stable intrinsic dielectric constants used as references for the VLM and fusion. However, no analysis or correction is provided for modulation by object curvature, surface roughness, incidence angle, or multipath effects, which directly risks systematic bias in the extracted values and undermines the physics-grounded claim.
minor comments (2)
  1. [§3.3] The description of the adaptive fusion mechanism in §3.3 lacks explicit equations or pseudocode for uncertainty computation and conflict resolution, reducing reproducibility.
  2. [§4] Figure captions and the experimental setup description could clarify the distribution of the 120 trials across object types, environments, and counterfeits to support the reported aggregate accuracy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results and the physical foundations of the PRCA method. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of 96.08% accuracy is presented without quantitative baselines, error bars, ablation studies on PRCA extraction or fusion components, or statistical validation of the uncertainty estimation. This absence prevents assessment of whether the result is on par with SOTA closed-set methods or if the training-free pipeline incurs hidden performance costs.

    Authors: The abstract summarizes the aggregate accuracy across 120 experiments, while §4 reports direct comparisons to closed-set SOTA benchmarks that achieve comparable accuracy. We agree, however, that the section would be strengthened by explicit error bars (standard deviation across trials), component-wise ablations for PRCA and CAG, and statistical tests on the uncertainty estimates. We will revise §4 accordingly to include these elements, enabling clearer evaluation of the training-free pipeline. revision: yes

  2. Referee: [§3.2] §3.2 (Electromagnetic Characterization Pipeline): The PRCA method and weighted vector synthesis are asserted to extract stable intrinsic dielectric constants used as references for the VLM and fusion. However, no analysis or correction is provided for modulation by object curvature, surface roughness, incidence angle, or multipath effects, which directly risks systematic bias in the extracted values and undermines the physics-grounded claim.

    Authors: The PRCA formulation isolates the dominant peak reflection cell to approximate the intrinsic dielectric constant, which is intended to reduce sensitivity to distributed surface effects. We acknowledge that the current manuscript does not provide explicit analysis or correction terms for curvature, roughness, incidence angle, or multipath. We will revise §3.2 to add a dedicated discussion of these factors, their expected influence on the extracted values, and how the downstream CAG and adaptive fusion stages mitigate residual bias, supported by additional sensitivity checks in the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in VLMaterial's empirical framework

full rationale

The paper presents a training-free camera-radar fusion system that combines pre-trained VLMs (via SAM and CAG) with a proposed electromagnetic pipeline using the PRCA method and weighted vector synthesis to extract dielectric constants. The central performance claim of 96.08% accuracy is obtained directly from 120+ real-world experiments on 41 objects and counterfeits, without any derivation, prediction, or first-principles result that reduces by construction to fitted parameters, self-defined quantities, or self-citations. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results are present; the radar extraction step is introduced as a novel heuristic rather than a tautological re-expression of the output accuracy.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

Abstract-only review limits visibility; the framework rests on domain assumptions about radar robustness and VLM interpretability of physical parameters, plus one newly introduced processing entity (PRCA) without shown independent validation.

free parameters (2)
  • weights in vector synthesis
    Used to combine radar reflections into dielectric constant estimate; value not specified in abstract.
  • uncertainty thresholds for fusion
    Control how optical and radar outputs are reconciled; not quantified.
axioms (2)
  • domain assumption Millimeter-wave radar provides material sensing independent of lighting and visual appearance.
    Invoked to justify the electromagnetic pipeline as a stable reference.
  • domain assumption Context-augmented generation allows a VLM to treat electromagnetic parameters as reliable material cues.
    Core premise enabling the radar knowledge injection step.
invented entities (1)
  • effective peak reflection cell area (PRCA) method no independent evidence
    purpose: Extract intrinsic dielectric constant from radar signals for VLM input.
    New radar processing technique introduced to bridge electromagnetic data to semantic reasoning.

pith-pipeline@v0.9.0 · 5583 in / 1577 out tokens · 87969 ms · 2026-05-10T14:54:59.766137+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Object detection in 20 years: A survey,

    Z. Zou, K. Chen, Z. Shi, Y . Guo, and J. Ye, “Object detection in 20 years: A survey,”Proceedings of the IEEE, vol. 111, no. 3, pp. 257–276, 2023

  2. [2]

    Vision-language models for vision tasks: A survey,

    J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 8, pp. 5625–5644, 2024

  3. [3]

    Object detection with multimodal large vision-language models: An in-depth review,

    R. Sapkota and M. Karkee, “Object detection with multimodal large vision-language models: An in-depth review,”Available at SSRN 5233953, 2025

  4. [4]

    Intelligent recognition of composite material damage based on deep learning and infrared testing,

    C. Li, X. Wei, W. He, H. Guo, J. Zhong, X. Wu, and H. Xu, “Intelligent recognition of composite material damage based on deep learning and infrared testing,”Optics Express, vol. 29, no. 20, pp. 31 739–31 753, 2021

  5. [5]

    Lowlight object recognition by deep learning with passive three-dimensional integral imaging in visible and long wave infrared wavelengths,

    P. Wani, K. Usmani, G. Krishnan, T. O’Connor, and B. Javidi, “Lowlight object recognition by deep learning with passive three-dimensional integral imaging in visible and long wave infrared wavelengths,”Optics Express, vol. 30, no. 2, pp. 1205–1218, 2022

  6. [6]

    Akte- liquid: Acoustic-based liquid identification with smartphones,

    X. Sun, W. Deng, X. Wei, D. Fang, B. Li, and X. Chen, “Akte- liquid: Acoustic-based liquid identification with smartphones,”ACM Transactions on Sensor Networks, vol. 19, no. 1, pp. 1–24, 2023

  7. [7]

    Material recognition using robotic hand with capacitive tactile sensor array and machine learning,

    X. Liu, W. Yang, F. Meng, and T. Sun, “Material recognition using robotic hand with capacitive tactile sensor array and machine learning,” IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–9, 2024

  8. [8]

    Object pose and surface material recognition using a single-time-of- flight camera,

    D. Yang, D. An, T. Xu, Y . Zhang, Q. Wang, Z. Pan, and Y . Yue, “Object pose and surface material recognition using a single-time-of- flight camera,”Advanced Photonics Nexus, vol. 3, no. 5, pp. 056 001– 056 001, 2024

  9. [9]

    N-imagenet: Towards robust, fine-grained object recognition with event cameras,

    J. Kim, J. Bae, G. Park, D. Zhang, and Y . M. Kim, “N-imagenet: Towards robust, fine-grained object recognition with event cameras,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2146–2156

  10. [10]

    Beyond appearances: Material seg- mentation with embedded spectral information from rgb-d imagery,

    F. Perez and H. Rueda-Chac ´on, “Beyond appearances: Material seg- mentation with embedded spectral information from rgb-d imagery,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 293–301

  11. [11]

    Soda: A large-scale open site object detection dataset for deep learning in construction,

    R. Duan, H. Deng, M. Tian, Y . Deng, and J. Lin, “Soda: A large-scale open site object detection dataset for deep learning in construction,” Automation in Construction, vol. 142, p. 104499, 2022

  12. [12]

    Physically grounded vision-language models for robotic manipulation,

    J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, “Physically grounded vision-language models for robotic manipulation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 12 462–12 469

  13. [13]

    Tagtag: Material sensing with commodity rfid,

    B. Xie, J. Xiong, X. Chen, E. Chai, L. Li, Z. Tang, and D. Fang, “Tagtag: Material sensing with commodity rfid,” inProceedings of the 17th conference on embedded networked sensor systems, 2019, pp. 338– 350

  14. [14]

    Wi-painter: Fine- grained material identification and image delineation using cots wifi devices,

    D. Yan, P. Yang, F. Shang, W. Jiang, and X.-Y . Li, “Wi-painter: Fine- grained material identification and image delineation using cots wifi devices,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 7, no. 4, pp. 1–25, 2024

  15. [15]

    Siwa: See into walls via deep uwb radar,

    T. Zheng, Z. Chen, J. Luo, L. Ke, C. Zhao, and Y . Yang, “Siwa: See into walls via deep uwb radar,” inProceedings of the 27th Annual International Conference on Mobile Computing and Networking, 2021, pp. 323–336

  16. [16]

    Nlos identification and ranging trustworthiness for indoor positioning with llm-based uwb-imu fusion,

    H. Yang, Y . Wang, C. K. Seow, Z. Li, M. Sun, C. De Cock, J. Bi, W. Joseph, and D. Plets, “Nlos identification and ranging trustworthiness for indoor positioning with llm-based uwb-imu fusion,”IEEE Transac- tions on Instrumentation and Measurement, 2025

  17. [17]

    Material-id: Towards mmwave-based material identification,

    G. Chen, C. Luo, H. Zeng, G. Wen, Z. Luo, J. Wang, J. Zhang, Z. Yang, and J. Li, “Material-id: Towards mmwave-based material identification,” ACM Transactions on Sensor Networks, vol. 21, no. 4, pp. 1–26, 2025

  18. [18]

    Can large language models identify materials from radar signals?

    J. Zhu, H. Deng, and H. Chen, “Can large language models identify materials from radar signals?”arXiv preprint arXiv:2508.03120, 2025

  19. [19]

    Robotera: non-contact fric- tion sensing for robotic grasping via wireless sub-terahertz perception,

    V . Yazdnian, R. Shen, and Y . Ghasempour, “Robotera: non-contact fric- tion sensing for robotic grasping via wireless sub-terahertz perception,” inProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems, 2025, pp. 172–185

  20. [20]

    Radarcat: Radar categorization for input & interaction,

    H.-S. Yeo, G. Flamich, P. Schrempf, D. Harris-Birtill, and A. Quigley, “Radarcat: Radar categorization for input & interaction,” inProceed- ings of the 29th Annual Symposium on User Interface Software and Technology, 2016, pp. 833–841

  21. [21]

    msense: Towards mobile material sensing with a single millimeter-wave radio,

    C. Wu, F. Zhang, B. Wang, and K. R. Liu, “msense: Towards mobile material sensing with a single millimeter-wave radio,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 3, pp. 1–20, 2020

  22. [22]

    Intelligent multi-modal sensing-communication integration: Synesthesia of machines,

    X. Cheng, H. Zhang, J. Zhang, S. Gao, S. Li, Z. Huang, L. Bai, Z. Yang, X. Zheng, and L. Yang, “Intelligent multi-modal sensing-communication integration: Synesthesia of machines,”IEEE Communications Surveys & Tutorials, vol. 26, no. 1, pp. 258–301, 2023

  23. [23]

    millieye: A lightweight mmwave radar and camera fusion system for robust object 15 detection,

    X. Shuai, Y . Shen, Y . Tang, S. Shi, L. Ji, and G. Xing, “millieye: A lightweight mmwave radar and camera fusion system for robust object 15 detection,” inProceedings of the International Conference on Internet- of-Things Design and Implementation, 2021, pp. 145–157

  24. [24]

    Fuselid: Vision-language model-based camera-radar fusion for liquid identification,

    H. Deng, J. Zhu, and H. Chen, “Fuselid: Vision-language model-based camera-radar fusion for liquid identification,” inCompanion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2025, pp. 1500–1504

  25. [25]

    Rodnet: A real-time radar object detection network cross-supervised by camera- radar fused object 3d localization,

    Y . Wang, Z. Jiang, Y . Li, J.-N. Hwang, G. Xing, and H. Liu, “Rodnet: A real-time radar object detection network cross-supervised by camera- radar fused object 3d localization,”IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 4, pp. 954–967, 2021

  26. [26]

    Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer,

    Y . Kim, S. Kim, J. W. Choi, and D. Kum, “Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 1160–1168

  27. [27]

    Crfusion: Fine-grained object identification using rf-image modality fusion,

    L. Xiao, Y . Yang, Z. Chen, G. Yue, P. Mohapatra, and P. Hu, “Crfusion: Fine-grained object identification using rf-image modality fusion,”IEEE Transactions on Mobile Computing, 2025

  28. [28]

    Black-box testing of deep neural networks through test case diversity,

    Z. Aghababaeyan, M. Abdellatif, L. Briand, M. Bagherzadehet al., “Black-box testing of deep neural networks through test case diversity,” IEEE Transactions on Software Engineering, vol. 49, no. 5, pp. 3182– 3204, 2023

  29. [29]

    Machine-learning methods for material identification using mmwave radar sensor,

    S. Skaria, N. Hendy, and A. Al-Hourani, “Machine-learning methods for material identification using mmwave radar sensor,”IEEE Sensors Journal, vol. 23, no. 2, pp. 1471–1478, 2022

  30. [30]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, vol. 2, no. 1, 2023

  31. [31]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  32. [32]

    E. O. Brigham,The fast Fourier transform and its applications. Prentice-Hall, Inc., 1988

  33. [33]

    D. K. Barton,Radar equations for modern radar. Artech House, 2013

  34. [34]

    Radar backscatter from conducting polyhedral spheres,

    P. A. Bernhardt, “Radar backscatter from conducting polyhedral spheres,”IEEE Antennas and Propagation Magazine, vol. 52, no. 5, pp. 52–70, 2011

  35. [35]

    Microwave remote sensing: Active and passive. volume 3-from theory to applications,

    F. T. Ulaby, R. K. Moore, and A. K. Fung, “Microwave remote sensing: Active and passive. volume 3-from theory to applications,” 1986

  36. [36]

    G. G. Raju,Dielectrics in electric fields: Tables, Atoms, and Molecules. CRC press, 2017

  37. [37]

    A Survey on Hallucination in Large Vision-Language Models

    H. Liu, W. Xue, Y . Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng, “A survey on hallucination in large vision-language models,” arXiv preprint arXiv:2402.00253, 2024

  38. [38]

    Photon, poisson noise,

    S. W. Hasinoff, “Photon, poisson noise,” inComputer vision. Springer, 2014, pp. 608–610

  39. [39]

    Feature congestion: a measure of display clutter,

    R. Rosenholtz, Y . Li, J. Mansfield, and Z. Jin, “Feature congestion: a measure of display clutter,” inProceedings of the SIGCHI conference on Human factors in computing systems, 2005, pp. 761–770

  40. [40]

    What uncertainties do we need in bayesian deep learning for computer vision?

    A. Kendall and Y . Gal, “What uncertainties do we need in bayesian deep learning for computer vision?”Advances in neural information processing systems, vol. 30, 2017

  41. [41]

    M. A. Richardset al.,Fundamentals of radar signal processing. Mcgraw-hill New York, 2005, vol. 1

  42. [42]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  43. [43]

    Language Models (Mostly) Know What They Know

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022

  44. [44]

    The boltzmann distribution,

    D. K. Russell, “The boltzmann distribution,”Journal of Chemical Education, vol. 73, no. 4, p. 299, 1996

  45. [45]

    R. M. Gray,Entropy and information theory. Springer Science & Business Media, 2011

  46. [46]

    Shoelace formula: Connecting the area of a polygon and the vector cross product,

    Y . Lee and W. Lim, “Shoelace formula: Connecting the area of a polygon and the vector cross product,”Mathematics teacher, vol. 110, no. 8, pp. 631–636, 2017

  47. [47]

    P. G. Huray,Maxwell’s equations. John Wiley & Sons, 2009

  48. [48]

    Optical phase measurement emphasized,

    E. Salik, “Optical phase measurement emphasized,” inEducation and Training in Optics and Photonics. Optica Publishing Group, 2009, p. EMB3

  49. [49]

    Effects of building materials and structures on radiowave propagation above about 100 mhz,

    P. Series, “Effects of building materials and structures on radiowave propagation above about 100 mhz,”recommendation itu-r, pp. 2040–1, 2015. 16 APPENDIXA CALCULATION OFPRCA. To quantify the coverage of the beam, we calculate the -3dB beam area asA r. The process involves identifying the main lobe region and integrating the area of its grid cells. First,...

  50. [50]

    Second, the total area is approximated by summing the areas of the discrete quadrilateral grid cells withinR

    We extract the connected component of grid points containing the peak where the amplitude exceedsP th, filtering out isolated sidelobes. Second, the total area is approximated by summing the areas of the discrete quadrilateral grid cells withinR. For a grid cell located at index(i, j), the Cartesian coordinates of its four vertices are defined asv 1 = (x ...