NF-TrackLLM: Joint Prediction of UAV Trajectory and Near-Field Beam for LAE XL-MIMO Systems
Pith reviewed 2026-06-29 15:44 UTC · model grok-4.3
The pith
NF-TrackLLM uses aligned multi-modal sensing data and a cascaded GPT-2 strategy to jointly predict UAV trajectories and near-field beams in XL-MIMO.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By building upon aligned multi-modal representations from sensing inputs, a GPT-2-based spatiotemporal reasoning backbone with a cascaded prediction strategy infers future trajectories first and then guides beam prediction, achieving accurate results in simulated dense urban low-altitude scenarios.
What carries the argument
The cascaded prediction strategy that first infers trajectories and then uses them as geometric priors for beam prediction, supported by aligned multi-modal representations.
If this is right
- Accurate beam prediction becomes possible despite near-field distance sensitivity.
- Reliable UAV trajectory tracking is achieved in dense urban low-altitude scenarios.
- Joint handling of localization and beam management improves overall system performance.
- Environmental semantics from multi-modal data guide the predictions effectively.
Where Pith is reading between the lines
- Extending this approach to non-UAV mobile users could broaden its applicability in wireless networks.
- Integrating real-time sensor data instead of simulated channels might enhance robustness in actual deployments.
- If the cascaded strategy proves effective, it could reduce computational overhead compared to simultaneous joint prediction methods.
Load-bearing premise
The aligned multi-modal representation derived from visual, LiDAR sensing, GPS, and channel generation supplies sufficient environmental semantics to make the cascaded trajectory-then-beam prediction effective.
What would settle it
Observing in field trials that beam prediction accuracy or trajectory tracking reliability drops significantly below simulation levels when using actual sensor inputs and real propagation would falsify the central claim.
Figures
read the original abstract
User localization and beam management are tightly linked in extremely large-scale multiple-input multiple-output (XL-MIMO) systems, especially in dense low-altitude economy (LAE) scenarios. However, the near-field propagation in XL-MIMO introduces strong distance sensitivity and complex spatial coupling, which makes joint trajectory and beam prediction challenging. Meanwhile, large language models (LLMs) have attracted attention in physical-layer transmission for modeling long-range dependencies. In this paper, we propose NF-TrackLLM, a multi-modal semantic-aware framework for near-field unmanned aerial vehicles (UAVs) positioning and beam prediction in XL-MIMO systems. By incorporating visual and LiDAR sensing into a Sionna-based channel generation pipeline, environmental semantics and GPS are utilized to guide trajectory and beam prediction. Built upon the aligned multi-modal representation, a GPT-2-based spatiotemporal reasoning backbone, and a cascaded prediction strategy are employed, where future trajectories are first inferred and then used to guide beam prediction as geometric priors. Simulation results demonstrate that NF-TrackLLM achieves accurate beam prediction and reliable UAV trajectory tracking in dense urban low-altitude scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes NF-TrackLLM, a multi-modal semantic-aware framework for joint UAV trajectory and near-field beam prediction in XL-MIMO systems for LAE scenarios. It incorporates visual and LiDAR sensing into a Sionna-based channel generation pipeline along with GPS data to create an aligned multi-modal representation. A GPT-2-based spatiotemporal reasoning backbone is used with a cascaded prediction strategy in which future trajectories are first inferred and then employed as geometric priors to guide beam prediction. The authors state that simulation results demonstrate accurate beam prediction and reliable UAV trajectory tracking in dense urban low-altitude scenarios.
Significance. If the simulation results hold with proper validation, the work could contribute to integrating LLMs and multi-modal environmental sensing for physical-layer tasks in near-field XL-MIMO, particularly for UAV beam management and localization where distance sensitivity and spatial coupling are pronounced. The cascaded trajectory-then-beam approach is a logical way to inject geometric priors. However, the complete absence of any quantitative metrics, baselines, ablations, or implementation details prevents assessment of whether the multi-modal alignment actually supplies sufficient semantics.
major comments (1)
- [Abstract] Abstract: The central claim that 'Simulation results demonstrate that NF-TrackLLM achieves accurate beam prediction and reliable UAV trajectory tracking' is unsupported by any quantitative metrics, baselines, dataset details, error bars, or ablation results. This is load-bearing because the soundness of the performance claims cannot be evaluated from the provided text.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for quantitative support behind our performance claims. We agree that the current abstract does not provide sufficient evidence and will revise the manuscript accordingly to enable proper evaluation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'Simulation results demonstrate that NF-TrackLLM achieves accurate beam prediction and reliable UAV trajectory tracking' is unsupported by any quantitative metrics, baselines, dataset details, error bars, or ablation results. This is load-bearing because the soundness of the performance claims cannot be evaluated from the provided text.
Authors: We agree that the abstract's claim requires explicit quantitative backing. In the revised version we will (1) replace the generic statement with concrete metrics such as average trajectory RMSE (in meters) and beamforming gain loss (in dB) obtained from the Sionna-generated LAE scenarios, (2) report results against at least two baselines (e.g., Kalman-filter trajectory prediction followed by conventional near-field beamforming, and a non-cascaded GPT-2 variant), (3) describe the dataset size, urban map parameters, and number of Monte-Carlo runs with error bars, and (4) add a dedicated ablation subsection quantifying the contribution of the visual/LiDAR modalities and the cascaded trajectory-to-beam prior. These additions will be placed both in an expanded abstract and in a new Results section. revision: yes
Circularity Check
No circularity identified; derivation chain not reducible to inputs
full rationale
The abstract and available description outline a multi-modal framework (visual/LiDAR/GPS + Sionna channel model) feeding a GPT-2 backbone for cascaded trajectory-then-beam prediction, with simulation results claimed as validation. No equations, parameter-fitting steps, self-citations, or uniqueness theorems are present in the provided text. No load-bearing claim reduces by construction to a fitted input or self-referential definition. The central claim rests on external simulation outcomes rather than internal redefinition, satisfying the criteria for a self-contained non-circular presentation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Pioneering scalable prototype for mid-band XL-MIMO systems: Design and implementation,
J. Tianet al., “Pioneering scalable prototype for mid-band XL-MIMO systems: Design and implementation,”IEEE J. Sel. Areas Commun., vol. 44, pp. 3365–3381, Jan. 2026
2026
-
[2]
Study on artificial intelligence (AI) / machine learning (ML) for NR air interface,
3GPP, “Study on artificial intelligence (AI) / machine learning (ML) for NR air interface,” 3rd Generation Partnership Project (3GPP), Technical Report (TR) 38.843, 2024
2024
-
[3]
Toward extra large-scale MIMO: New channel properties and low-cost designs,
Y . Hanet al., “Toward extra large-scale MIMO: New channel properties and low-cost designs,”IEEE Internet Things J., vol. 10, no. 16, pp. 14 569–14 594, Aug. 2023
2023
-
[4]
Channel estimation for extremely large- scale MIMO: Far-field or near-field?
M. Cui and L. Dai, “Channel estimation for extremely large- scale MIMO: Far-field or near-field?”IEEE Trans. Commun., vol. 70, no. 4, pp. 2663–2677, Apr. 2022
2022
-
[5]
Vision-position multi-modal beam prediction using real millimeter wave datasets,
G. Charanet al., “Vision-position multi-modal beam prediction using real millimeter wave datasets,” inProc. IEEE WCNC, Apr. 2022, pp. 2727–2731
2022
-
[6]
Lidar aided future beam prediction in real-world millimeter wave V2I communications,
S. Jiang, G. Charan, and A. Alkhateeb, “Lidar aided future beam prediction in real-world millimeter wave V2I communications,” IEEE Wireless Commun. Lett., vol. 12, no. 2, pp. 212–216, Feb. 2022
2022
-
[7]
Multi-modal large models based beam pre- diction: An example empowered by DeepSeek,
Y . Zhaoet al., “Multi-modal large models based beam pre- diction: An example empowered by DeepSeek,”arXiv preprint arXiv:2506.05921, 2025
-
[8]
M2BeamLLM: Multimodal sensing- empowered mmwave beam prediction with large language mod- els,
C. Zhenget al., “M2BeamLLM: Multimodal sensing- empowered mmwave beam prediction with large language mod- els,”arXiv preprint arXiv:2506.14532, 2025
-
[9]
WiFo: Wireless foundation model for channel prediction,
B. Liuet al., “WiFo: Wireless foundation model for channel prediction,”Sci. China Inf. Sci., vol. 68, no. 6, p. 162302, May. 2025
2025
-
[10]
Multimodal-NF: A Wireless Dataset for Near-Field Low-Altitude Sensing and Communications
M. Liet al., “Multimodal-NF: A wireless dataset for near- field low-altitude sensing and communications,”arXiv preprint arXiv:2603.28280, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Language models are unsupervised multitask learners,
A. Radfordet al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019
2019
-
[12]
Deep residual learning for image recognition,
K. Heet al., “Deep residual learning for image recognition,” in Proc. IEEE CVPR, Jun. 2016, pp. 770–778
2016
-
[13]
Pointnet: Deep learning on point sets for 3D classification and segmentation,
C. R. Qiet al., “Pointnet: Deep learning on point sets for 3D classification and segmentation,” inProc. IEEE CVPR, Jul. 2017, pp. 652–660
2017
-
[14]
Sionna: An open-source library for next-generation physical layer research,
S. D ¨orner, J. Hoydis, and S. ten Brink, “Sionna: An open- source library for next-generation physical layer research,” arXiv preprint arXiv:2203.11854, 2022
-
[15]
Recurrent neural network based beam prediction for millimeter-wave 5G systems,
S. Khunteta and A. K. R. Chavva, “Recurrent neural network based beam prediction for millimeter-wave 5G systems,” in Proc. IEEE WCNC, Mar. 2021, pp. 1–6
2021
-
[16]
Multi-cell multi-beam prediction using auto-encoder LSTM for mmwave systems,
S. H. A. Shah and S. Rangan, “Multi-cell multi-beam prediction using auto-encoder LSTM for mmwave systems,”IEEE Trans. Wireless Commun., vol. 21, no. 12, pp. 10 366–10 380, Dec. 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.