NF-TrackLLM: Joint Prediction of UAV Trajectory and Near-Field Beam for LAE XL-MIMO Systems

Jiachen Tian; Mengyuan Li; Qianfan Lu; Shi Jin; Xiao Li; Yu Han

arxiv: 2605.26928 · v1 · pith:C4UAEEBEnew · submitted 2026-05-26 · 📡 eess.SP

NF-TrackLLM: Joint Prediction of UAV Trajectory and Near-Field Beam for LAE XL-MIMO Systems

Qianfan Lu , Mengyuan Li , Jiachen Tian , Yu Han , Xiao Li , Shi Jin This is my paper

Pith reviewed 2026-06-29 15:44 UTC · model grok-4.3

classification 📡 eess.SP

keywords UAV trajectory predictionnear-field beam predictionXL-MIMO systemsmulti-modal sensingLLM for physical layerlow-altitude economy

0 comments

The pith

NF-TrackLLM uses aligned multi-modal sensing data and a cascaded GPT-2 strategy to jointly predict UAV trajectories and near-field beams in XL-MIMO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework called NF-TrackLLM for joint prediction of unmanned aerial vehicle trajectories and near-field beams in extremely large MIMO systems operating in low-altitude economy scenarios. It integrates visual and LiDAR sensing with GPS through a channel generation pipeline to create representations that capture environmental semantics. These feed into a GPT-2-based model that first predicts future trajectories and then uses those as geometric priors for beam prediction. A sympathetic reader would care because this addresses the distance sensitivity and spatial coupling issues in near-field propagation that hinder traditional beam management. Simulations indicate it delivers accurate beam prediction and reliable tracking in dense urban settings.

Core claim

By building upon aligned multi-modal representations from sensing inputs, a GPT-2-based spatiotemporal reasoning backbone with a cascaded prediction strategy infers future trajectories first and then guides beam prediction, achieving accurate results in simulated dense urban low-altitude scenarios.

What carries the argument

The cascaded prediction strategy that first infers trajectories and then uses them as geometric priors for beam prediction, supported by aligned multi-modal representations.

If this is right

Accurate beam prediction becomes possible despite near-field distance sensitivity.
Reliable UAV trajectory tracking is achieved in dense urban low-altitude scenarios.
Joint handling of localization and beam management improves overall system performance.
Environmental semantics from multi-modal data guide the predictions effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this approach to non-UAV mobile users could broaden its applicability in wireless networks.
Integrating real-time sensor data instead of simulated channels might enhance robustness in actual deployments.
If the cascaded strategy proves effective, it could reduce computational overhead compared to simultaneous joint prediction methods.

Load-bearing premise

The aligned multi-modal representation derived from visual, LiDAR sensing, GPS, and channel generation supplies sufficient environmental semantics to make the cascaded trajectory-then-beam prediction effective.

What would settle it

Observing in field trials that beam prediction accuracy or trajectory tracking reliability drops significantly below simulation levels when using actual sensor inputs and real propagation would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.26928 by Jiachen Tian, Mengyuan Li, Qianfan Lu, Shi Jin, Xiao Li, Yu Han.

**Figure 2.** Figure 2: Architecture of the NF-TrackLLM framework. Multi-modal inputs, including images, LiDAR, GPS, and prompts, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: MAE performance of different combinations of envi [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Top-K accuracy performance of different combinations of environmental semantics. (GRU) [6]. For performance evaluation, MAE is used for user localization, while Top-K accuracy is adopted for beam prediction. Top-1 accuracy indicates whether the optimal beam is correctly predicted, while Top-5 accuracy measures whether it is included among the five highest-ranked candidates. To investigate the impact of dif… view at source ↗

**Figure 5.** Figure 5: MAE performance of different methods [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Top-K accuracy performance of different methods [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: MAE performance of different Tprev [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Top-K accuracy performance of different Tprev. creases from 5 to 15, all methods benefit from richer temporal context, resulting in lower MAE and higher beam accuracy. Meanwhile, NF-TrackLLM achieves the best performance across all window lengths, showing its robustness. V. CONCLUSION In this paper, we presented a novel NF-TrackLLM framework for near-field beam prediction and UAV positioning in XL-MIMO sy… view at source ↗

read the original abstract

User localization and beam management are tightly linked in extremely large-scale multiple-input multiple-output (XL-MIMO) systems, especially in dense low-altitude economy (LAE) scenarios. However, the near-field propagation in XL-MIMO introduces strong distance sensitivity and complex spatial coupling, which makes joint trajectory and beam prediction challenging. Meanwhile, large language models (LLMs) have attracted attention in physical-layer transmission for modeling long-range dependencies. In this paper, we propose NF-TrackLLM, a multi-modal semantic-aware framework for near-field unmanned aerial vehicles (UAVs) positioning and beam prediction in XL-MIMO systems. By incorporating visual and LiDAR sensing into a Sionna-based channel generation pipeline, environmental semantics and GPS are utilized to guide trajectory and beam prediction. Built upon the aligned multi-modal representation, a GPT-2-based spatiotemporal reasoning backbone, and a cascaded prediction strategy are employed, where future trajectories are first inferred and then used to guide beam prediction as geometric priors. Simulation results demonstrate that NF-TrackLLM achieves accurate beam prediction and reliable UAV trajectory tracking in dense urban low-altitude scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract sketches an LLM-based joint predictor for UAV trajectories and near-field beams but supplies zero metrics or baselines, so the central claim cannot be checked.

read the letter

The main takeaway is that NF-TrackLLM tries to link multi-modal sensing to cascaded trajectory-then-beam prediction in near-field XL-MIMO for low-altitude UAVs, yet the abstract alone gives no numbers, so we cannot tell whether the simulations actually back the accuracy claims.

What is new is the concrete pipeline that feeds visual and LiDAR data plus GPS into a Sionna channel generator, aligns the representations, and then runs a GPT-2 spatiotemporal backbone with the trajectory output used as a geometric prior for beam prediction. That combination for dense urban LAE scenarios has not appeared in the prior LLM-for-wireless abstracts I have seen.

The paper does a clear job stating the problem: near-field distance sensitivity and spatial coupling make separate localization and beam management inefficient, and the high-level architecture shows how environmental semantics could supply useful priors.

The soft spots are straightforward. No quantitative results appear—no RMSE values, no comparison against Kalman filters or simpler regression baselines, no ablation on modality alignment or on the cascaded versus end-to-end strategy, and no description of the training set or Sionna configuration. Without those, the claim that the method achieves “accurate beam prediction and reliable UAV trajectory tracking” stays untestable. The assumption that the aligned multi-modal input carries enough semantics to make the cascade effective is reasonable on paper but remains unexamined.

This work would interest people already working on integrated sensing and communication or on applying sequence models to physical-layer tasks in UAV networks. A reader looking for a concrete example of how to wire sensor data into an LLM backbone could extract the structure, but anyone needing reproducible evidence would have to wait for the full experiments.

Because the full manuscript was not accessible, I cannot verify the implementation details or check for hidden fitting. If the complete version contains proper metrics, ablations, and baseline comparisons, it would merit peer review; on the current abstract it does not yet clear that bar.

Referee Report

1 major / 0 minor

Summary. The paper proposes NF-TrackLLM, a multi-modal semantic-aware framework for joint UAV trajectory and near-field beam prediction in XL-MIMO systems for LAE scenarios. It incorporates visual and LiDAR sensing into a Sionna-based channel generation pipeline along with GPS data to create an aligned multi-modal representation. A GPT-2-based spatiotemporal reasoning backbone is used with a cascaded prediction strategy in which future trajectories are first inferred and then employed as geometric priors to guide beam prediction. The authors state that simulation results demonstrate accurate beam prediction and reliable UAV trajectory tracking in dense urban low-altitude scenarios.

Significance. If the simulation results hold with proper validation, the work could contribute to integrating LLMs and multi-modal environmental sensing for physical-layer tasks in near-field XL-MIMO, particularly for UAV beam management and localization where distance sensitivity and spatial coupling are pronounced. The cascaded trajectory-then-beam approach is a logical way to inject geometric priors. However, the complete absence of any quantitative metrics, baselines, ablations, or implementation details prevents assessment of whether the multi-modal alignment actually supplies sufficient semantics.

major comments (1)

[Abstract] Abstract: The central claim that 'Simulation results demonstrate that NF-TrackLLM achieves accurate beam prediction and reliable UAV trajectory tracking' is unsupported by any quantitative metrics, baselines, dataset details, error bars, or ablation results. This is load-bearing because the soundness of the performance claims cannot be evaluated from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for quantitative support behind our performance claims. We agree that the current abstract does not provide sufficient evidence and will revise the manuscript accordingly to enable proper evaluation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Simulation results demonstrate that NF-TrackLLM achieves accurate beam prediction and reliable UAV trajectory tracking' is unsupported by any quantitative metrics, baselines, dataset details, error bars, or ablation results. This is load-bearing because the soundness of the performance claims cannot be evaluated from the provided text.

Authors: We agree that the abstract's claim requires explicit quantitative backing. In the revised version we will (1) replace the generic statement with concrete metrics such as average trajectory RMSE (in meters) and beamforming gain loss (in dB) obtained from the Sionna-generated LAE scenarios, (2) report results against at least two baselines (e.g., Kalman-filter trajectory prediction followed by conventional near-field beamforming, and a non-cascaded GPT-2 variant), (3) describe the dataset size, urban map parameters, and number of Monte-Carlo runs with error bars, and (4) add a dedicated ablation subsection quantifying the contribution of the visual/LiDAR modalities and the cascaded trajectory-to-beam prior. These additions will be placed both in an expanded abstract and in a new Results section. revision: yes

Circularity Check

0 steps flagged

No circularity identified; derivation chain not reducible to inputs

full rationale

The abstract and available description outline a multi-modal framework (visual/LiDAR/GPS + Sionna channel model) feeding a GPT-2 backbone for cascaded trajectory-then-beam prediction, with simulation results claimed as validation. No equations, parameter-fitting steps, self-citations, or uniqueness theorems are present in the provided text. No load-bearing claim reduces by construction to a fitted input or self-referential definition. The central claim rests on external simulation outcomes rather than internal redefinition, satisfying the criteria for a self-contained non-circular presentation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5746 in / 1061 out tokens · 51835 ms · 2026-06-29T15:44:40.664336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Pioneering scalable prototype for mid-band XL-MIMO systems: Design and implementation,

J. Tianet al., “Pioneering scalable prototype for mid-band XL-MIMO systems: Design and implementation,”IEEE J. Sel. Areas Commun., vol. 44, pp. 3365–3381, Jan. 2026

2026
[2]

Study on artificial intelligence (AI) / machine learning (ML) for NR air interface,

3GPP, “Study on artificial intelligence (AI) / machine learning (ML) for NR air interface,” 3rd Generation Partnership Project (3GPP), Technical Report (TR) 38.843, 2024

2024
[3]

Toward extra large-scale MIMO: New channel properties and low-cost designs,

Y . Hanet al., “Toward extra large-scale MIMO: New channel properties and low-cost designs,”IEEE Internet Things J., vol. 10, no. 16, pp. 14 569–14 594, Aug. 2023

2023
[4]

Channel estimation for extremely large- scale MIMO: Far-field or near-field?

M. Cui and L. Dai, “Channel estimation for extremely large- scale MIMO: Far-field or near-field?”IEEE Trans. Commun., vol. 70, no. 4, pp. 2663–2677, Apr. 2022

2022
[5]

Vision-position multi-modal beam prediction using real millimeter wave datasets,

G. Charanet al., “Vision-position multi-modal beam prediction using real millimeter wave datasets,” inProc. IEEE WCNC, Apr. 2022, pp. 2727–2731

2022
[6]

Lidar aided future beam prediction in real-world millimeter wave V2I communications,

S. Jiang, G. Charan, and A. Alkhateeb, “Lidar aided future beam prediction in real-world millimeter wave V2I communications,” IEEE Wireless Commun. Lett., vol. 12, no. 2, pp. 212–216, Feb. 2022

2022
[7]

Multi-modal large models based beam pre- diction: An example empowered by DeepSeek,

Y . Zhaoet al., “Multi-modal large models based beam pre- diction: An example empowered by DeepSeek,”arXiv preprint arXiv:2506.05921, 2025

work page arXiv 2025
[8]

M2BeamLLM: Multimodal sensing- empowered mmwave beam prediction with large language mod- els,

C. Zhenget al., “M2BeamLLM: Multimodal sensing- empowered mmwave beam prediction with large language mod- els,”arXiv preprint arXiv:2506.14532, 2025

work page arXiv 2025
[9]

WiFo: Wireless foundation model for channel prediction,

B. Liuet al., “WiFo: Wireless foundation model for channel prediction,”Sci. China Inf. Sci., vol. 68, no. 6, p. 162302, May. 2025

2025
[10]

Multimodal-NF: A Wireless Dataset for Near-Field Low-Altitude Sensing and Communications

M. Liet al., “Multimodal-NF: A wireless dataset for near- field low-altitude sensing and communications,”arXiv preprint arXiv:2603.28280, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Language models are unsupervised multitask learners,

A. Radfordet al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

2019
[12]

Deep residual learning for image recognition,

K. Heet al., “Deep residual learning for image recognition,” in Proc. IEEE CVPR, Jun. 2016, pp. 770–778

2016
[13]

Pointnet: Deep learning on point sets for 3D classification and segmentation,

C. R. Qiet al., “Pointnet: Deep learning on point sets for 3D classification and segmentation,” inProc. IEEE CVPR, Jul. 2017, pp. 652–660

2017
[14]

Sionna: An open-source library for next-generation physical layer research,

S. D ¨orner, J. Hoydis, and S. ten Brink, “Sionna: An open- source library for next-generation physical layer research,” arXiv preprint arXiv:2203.11854, 2022

work page arXiv 2022
[15]

Recurrent neural network based beam prediction for millimeter-wave 5G systems,

S. Khunteta and A. K. R. Chavva, “Recurrent neural network based beam prediction for millimeter-wave 5G systems,” in Proc. IEEE WCNC, Mar. 2021, pp. 1–6

2021
[16]

Multi-cell multi-beam prediction using auto-encoder LSTM for mmwave systems,

S. H. A. Shah and S. Rangan, “Multi-cell multi-beam prediction using auto-encoder LSTM for mmwave systems,”IEEE Trans. Wireless Commun., vol. 21, no. 12, pp. 10 366–10 380, Dec. 2022

2022

[1] [1]

Pioneering scalable prototype for mid-band XL-MIMO systems: Design and implementation,

J. Tianet al., “Pioneering scalable prototype for mid-band XL-MIMO systems: Design and implementation,”IEEE J. Sel. Areas Commun., vol. 44, pp. 3365–3381, Jan. 2026

2026

[2] [2]

Study on artificial intelligence (AI) / machine learning (ML) for NR air interface,

3GPP, “Study on artificial intelligence (AI) / machine learning (ML) for NR air interface,” 3rd Generation Partnership Project (3GPP), Technical Report (TR) 38.843, 2024

2024

[3] [3]

Toward extra large-scale MIMO: New channel properties and low-cost designs,

Y . Hanet al., “Toward extra large-scale MIMO: New channel properties and low-cost designs,”IEEE Internet Things J., vol. 10, no. 16, pp. 14 569–14 594, Aug. 2023

2023

[4] [4]

Channel estimation for extremely large- scale MIMO: Far-field or near-field?

M. Cui and L. Dai, “Channel estimation for extremely large- scale MIMO: Far-field or near-field?”IEEE Trans. Commun., vol. 70, no. 4, pp. 2663–2677, Apr. 2022

2022

[5] [5]

Vision-position multi-modal beam prediction using real millimeter wave datasets,

G. Charanet al., “Vision-position multi-modal beam prediction using real millimeter wave datasets,” inProc. IEEE WCNC, Apr. 2022, pp. 2727–2731

2022

[6] [6]

Lidar aided future beam prediction in real-world millimeter wave V2I communications,

S. Jiang, G. Charan, and A. Alkhateeb, “Lidar aided future beam prediction in real-world millimeter wave V2I communications,” IEEE Wireless Commun. Lett., vol. 12, no. 2, pp. 212–216, Feb. 2022

2022

[7] [7]

Multi-modal large models based beam pre- diction: An example empowered by DeepSeek,

Y . Zhaoet al., “Multi-modal large models based beam pre- diction: An example empowered by DeepSeek,”arXiv preprint arXiv:2506.05921, 2025

work page arXiv 2025

[8] [8]

M2BeamLLM: Multimodal sensing- empowered mmwave beam prediction with large language mod- els,

C. Zhenget al., “M2BeamLLM: Multimodal sensing- empowered mmwave beam prediction with large language mod- els,”arXiv preprint arXiv:2506.14532, 2025

work page arXiv 2025

[9] [9]

WiFo: Wireless foundation model for channel prediction,

B. Liuet al., “WiFo: Wireless foundation model for channel prediction,”Sci. China Inf. Sci., vol. 68, no. 6, p. 162302, May. 2025

2025

[10] [10]

Multimodal-NF: A Wireless Dataset for Near-Field Low-Altitude Sensing and Communications

M. Liet al., “Multimodal-NF: A wireless dataset for near- field low-altitude sensing and communications,”arXiv preprint arXiv:2603.28280, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Language models are unsupervised multitask learners,

A. Radfordet al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

2019

[12] [12]

Deep residual learning for image recognition,

K. Heet al., “Deep residual learning for image recognition,” in Proc. IEEE CVPR, Jun. 2016, pp. 770–778

2016

[13] [13]

Pointnet: Deep learning on point sets for 3D classification and segmentation,

C. R. Qiet al., “Pointnet: Deep learning on point sets for 3D classification and segmentation,” inProc. IEEE CVPR, Jul. 2017, pp. 652–660

2017

[14] [14]

Sionna: An open-source library for next-generation physical layer research,

S. D ¨orner, J. Hoydis, and S. ten Brink, “Sionna: An open- source library for next-generation physical layer research,” arXiv preprint arXiv:2203.11854, 2022

work page arXiv 2022

[15] [15]

Recurrent neural network based beam prediction for millimeter-wave 5G systems,

S. Khunteta and A. K. R. Chavva, “Recurrent neural network based beam prediction for millimeter-wave 5G systems,” in Proc. IEEE WCNC, Mar. 2021, pp. 1–6

2021

[16] [16]

Multi-cell multi-beam prediction using auto-encoder LSTM for mmwave systems,

S. H. A. Shah and S. Rangan, “Multi-cell multi-beam prediction using auto-encoder LSTM for mmwave systems,”IEEE Trans. Wireless Commun., vol. 21, no. 12, pp. 10 366–10 380, Dec. 2022

2022