pith. sign in

arxiv: 2605.26928 · v1 · pith:C4UAEEBEnew · submitted 2026-05-26 · 📡 eess.SP

NF-TrackLLM: Joint Prediction of UAV Trajectory and Near-Field Beam for LAE XL-MIMO Systems

Pith reviewed 2026-06-29 15:44 UTC · model grok-4.3

classification 📡 eess.SP
keywords UAV trajectory predictionnear-field beam predictionXL-MIMO systemsmulti-modal sensingLLM for physical layerlow-altitude economy
0
0 comments X

The pith

NF-TrackLLM uses aligned multi-modal sensing data and a cascaded GPT-2 strategy to jointly predict UAV trajectories and near-field beams in XL-MIMO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework called NF-TrackLLM for joint prediction of unmanned aerial vehicle trajectories and near-field beams in extremely large MIMO systems operating in low-altitude economy scenarios. It integrates visual and LiDAR sensing with GPS through a channel generation pipeline to create representations that capture environmental semantics. These feed into a GPT-2-based model that first predicts future trajectories and then uses those as geometric priors for beam prediction. A sympathetic reader would care because this addresses the distance sensitivity and spatial coupling issues in near-field propagation that hinder traditional beam management. Simulations indicate it delivers accurate beam prediction and reliable tracking in dense urban settings.

Core claim

By building upon aligned multi-modal representations from sensing inputs, a GPT-2-based spatiotemporal reasoning backbone with a cascaded prediction strategy infers future trajectories first and then guides beam prediction, achieving accurate results in simulated dense urban low-altitude scenarios.

What carries the argument

The cascaded prediction strategy that first infers trajectories and then uses them as geometric priors for beam prediction, supported by aligned multi-modal representations.

If this is right

  • Accurate beam prediction becomes possible despite near-field distance sensitivity.
  • Reliable UAV trajectory tracking is achieved in dense urban low-altitude scenarios.
  • Joint handling of localization and beam management improves overall system performance.
  • Environmental semantics from multi-modal data guide the predictions effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending this approach to non-UAV mobile users could broaden its applicability in wireless networks.
  • Integrating real-time sensor data instead of simulated channels might enhance robustness in actual deployments.
  • If the cascaded strategy proves effective, it could reduce computational overhead compared to simultaneous joint prediction methods.

Load-bearing premise

The aligned multi-modal representation derived from visual, LiDAR sensing, GPS, and channel generation supplies sufficient environmental semantics to make the cascaded trajectory-then-beam prediction effective.

What would settle it

Observing in field trials that beam prediction accuracy or trajectory tracking reliability drops significantly below simulation levels when using actual sensor inputs and real propagation would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.26928 by Jiachen Tian, Mengyuan Li, Qianfan Lu, Shi Jin, Xiao Li, Yu Han.

Figure 1
Figure 1. Figure 1: Illustration of the XL-MIMO system model: The BS [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the NF-TrackLLM framework. Multi-modal inputs, including images, LiDAR, GPS, and prompts, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MAE performance of different combinations of envi [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Top-K accuracy performance of different combinations of environmental semantics. (GRU) [6]. For performance evaluation, MAE is used for user localization, while Top-K accuracy is adopted for beam prediction. Top-1 accuracy indicates whether the optimal beam is correctly predicted, while Top-5 accuracy measures whether it is included among the five highest-ranked candidates. To investigate the impact of dif… view at source ↗
Figure 5
Figure 5. Figure 5: MAE performance of different methods [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top-K accuracy performance of different methods [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MAE performance of different Tprev [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top-K accuracy performance of different Tprev. creases from 5 to 15, all methods benefit from richer temporal context, resulting in lower MAE and higher beam accuracy. Meanwhile, NF-TrackLLM achieves the best performance across all window lengths, showing its robustness. V. CONCLUSION In this paper, we presented a novel NF-TrackLLM frame￾work for near-field beam prediction and UAV positioning in XL-MIMO sy… view at source ↗
read the original abstract

User localization and beam management are tightly linked in extremely large-scale multiple-input multiple-output (XL-MIMO) systems, especially in dense low-altitude economy (LAE) scenarios. However, the near-field propagation in XL-MIMO introduces strong distance sensitivity and complex spatial coupling, which makes joint trajectory and beam prediction challenging. Meanwhile, large language models (LLMs) have attracted attention in physical-layer transmission for modeling long-range dependencies. In this paper, we propose NF-TrackLLM, a multi-modal semantic-aware framework for near-field unmanned aerial vehicles (UAVs) positioning and beam prediction in XL-MIMO systems. By incorporating visual and LiDAR sensing into a Sionna-based channel generation pipeline, environmental semantics and GPS are utilized to guide trajectory and beam prediction. Built upon the aligned multi-modal representation, a GPT-2-based spatiotemporal reasoning backbone, and a cascaded prediction strategy are employed, where future trajectories are first inferred and then used to guide beam prediction as geometric priors. Simulation results demonstrate that NF-TrackLLM achieves accurate beam prediction and reliable UAV trajectory tracking in dense urban low-altitude scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes NF-TrackLLM, a multi-modal semantic-aware framework for joint UAV trajectory and near-field beam prediction in XL-MIMO systems for LAE scenarios. It incorporates visual and LiDAR sensing into a Sionna-based channel generation pipeline along with GPS data to create an aligned multi-modal representation. A GPT-2-based spatiotemporal reasoning backbone is used with a cascaded prediction strategy in which future trajectories are first inferred and then employed as geometric priors to guide beam prediction. The authors state that simulation results demonstrate accurate beam prediction and reliable UAV trajectory tracking in dense urban low-altitude scenarios.

Significance. If the simulation results hold with proper validation, the work could contribute to integrating LLMs and multi-modal environmental sensing for physical-layer tasks in near-field XL-MIMO, particularly for UAV beam management and localization where distance sensitivity and spatial coupling are pronounced. The cascaded trajectory-then-beam approach is a logical way to inject geometric priors. However, the complete absence of any quantitative metrics, baselines, ablations, or implementation details prevents assessment of whether the multi-modal alignment actually supplies sufficient semantics.

major comments (1)
  1. [Abstract] Abstract: The central claim that 'Simulation results demonstrate that NF-TrackLLM achieves accurate beam prediction and reliable UAV trajectory tracking' is unsupported by any quantitative metrics, baselines, dataset details, error bars, or ablation results. This is load-bearing because the soundness of the performance claims cannot be evaluated from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for quantitative support behind our performance claims. We agree that the current abstract does not provide sufficient evidence and will revise the manuscript accordingly to enable proper evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'Simulation results demonstrate that NF-TrackLLM achieves accurate beam prediction and reliable UAV trajectory tracking' is unsupported by any quantitative metrics, baselines, dataset details, error bars, or ablation results. This is load-bearing because the soundness of the performance claims cannot be evaluated from the provided text.

    Authors: We agree that the abstract's claim requires explicit quantitative backing. In the revised version we will (1) replace the generic statement with concrete metrics such as average trajectory RMSE (in meters) and beamforming gain loss (in dB) obtained from the Sionna-generated LAE scenarios, (2) report results against at least two baselines (e.g., Kalman-filter trajectory prediction followed by conventional near-field beamforming, and a non-cascaded GPT-2 variant), (3) describe the dataset size, urban map parameters, and number of Monte-Carlo runs with error bars, and (4) add a dedicated ablation subsection quantifying the contribution of the visual/LiDAR modalities and the cascaded trajectory-to-beam prior. These additions will be placed both in an expanded abstract and in a new Results section. revision: yes

Circularity Check

0 steps flagged

No circularity identified; derivation chain not reducible to inputs

full rationale

The abstract and available description outline a multi-modal framework (visual/LiDAR/GPS + Sionna channel model) feeding a GPT-2 backbone for cascaded trajectory-then-beam prediction, with simulation results claimed as validation. No equations, parameter-fitting steps, self-citations, or uniqueness theorems are present in the provided text. No load-bearing claim reduces by construction to a fitted input or self-referential definition. The central claim rests on external simulation outcomes rather than internal redefinition, satisfying the criteria for a self-contained non-circular presentation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5746 in / 1061 out tokens · 51835 ms · 2026-06-29T15:44:40.664336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Pioneering scalable prototype for mid-band XL-MIMO systems: Design and implementation,

    J. Tianet al., “Pioneering scalable prototype for mid-band XL-MIMO systems: Design and implementation,”IEEE J. Sel. Areas Commun., vol. 44, pp. 3365–3381, Jan. 2026

  2. [2]

    Study on artificial intelligence (AI) / machine learning (ML) for NR air interface,

    3GPP, “Study on artificial intelligence (AI) / machine learning (ML) for NR air interface,” 3rd Generation Partnership Project (3GPP), Technical Report (TR) 38.843, 2024

  3. [3]

    Toward extra large-scale MIMO: New channel properties and low-cost designs,

    Y . Hanet al., “Toward extra large-scale MIMO: New channel properties and low-cost designs,”IEEE Internet Things J., vol. 10, no. 16, pp. 14 569–14 594, Aug. 2023

  4. [4]

    Channel estimation for extremely large- scale MIMO: Far-field or near-field?

    M. Cui and L. Dai, “Channel estimation for extremely large- scale MIMO: Far-field or near-field?”IEEE Trans. Commun., vol. 70, no. 4, pp. 2663–2677, Apr. 2022

  5. [5]

    Vision-position multi-modal beam prediction using real millimeter wave datasets,

    G. Charanet al., “Vision-position multi-modal beam prediction using real millimeter wave datasets,” inProc. IEEE WCNC, Apr. 2022, pp. 2727–2731

  6. [6]

    Lidar aided future beam prediction in real-world millimeter wave V2I communications,

    S. Jiang, G. Charan, and A. Alkhateeb, “Lidar aided future beam prediction in real-world millimeter wave V2I communications,” IEEE Wireless Commun. Lett., vol. 12, no. 2, pp. 212–216, Feb. 2022

  7. [7]

    Multi-modal large models based beam pre- diction: An example empowered by DeepSeek,

    Y . Zhaoet al., “Multi-modal large models based beam pre- diction: An example empowered by DeepSeek,”arXiv preprint arXiv:2506.05921, 2025

  8. [8]

    M2BeamLLM: Multimodal sensing- empowered mmwave beam prediction with large language mod- els,

    C. Zhenget al., “M2BeamLLM: Multimodal sensing- empowered mmwave beam prediction with large language mod- els,”arXiv preprint arXiv:2506.14532, 2025

  9. [9]

    WiFo: Wireless foundation model for channel prediction,

    B. Liuet al., “WiFo: Wireless foundation model for channel prediction,”Sci. China Inf. Sci., vol. 68, no. 6, p. 162302, May. 2025

  10. [10]

    Multimodal-NF: A Wireless Dataset for Near-Field Low-Altitude Sensing and Communications

    M. Liet al., “Multimodal-NF: A wireless dataset for near- field low-altitude sensing and communications,”arXiv preprint arXiv:2603.28280, 2026

  11. [11]

    Language models are unsupervised multitask learners,

    A. Radfordet al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

  12. [12]

    Deep residual learning for image recognition,

    K. Heet al., “Deep residual learning for image recognition,” in Proc. IEEE CVPR, Jun. 2016, pp. 770–778

  13. [13]

    Pointnet: Deep learning on point sets for 3D classification and segmentation,

    C. R. Qiet al., “Pointnet: Deep learning on point sets for 3D classification and segmentation,” inProc. IEEE CVPR, Jul. 2017, pp. 652–660

  14. [14]

    Sionna: An open-source library for next-generation physical layer research,

    S. D ¨orner, J. Hoydis, and S. ten Brink, “Sionna: An open- source library for next-generation physical layer research,” arXiv preprint arXiv:2203.11854, 2022

  15. [15]

    Recurrent neural network based beam prediction for millimeter-wave 5G systems,

    S. Khunteta and A. K. R. Chavva, “Recurrent neural network based beam prediction for millimeter-wave 5G systems,” in Proc. IEEE WCNC, Mar. 2021, pp. 1–6

  16. [16]

    Multi-cell multi-beam prediction using auto-encoder LSTM for mmwave systems,

    S. H. A. Shah and S. Rangan, “Multi-cell multi-beam prediction using auto-encoder LSTM for mmwave systems,”IEEE Trans. Wireless Commun., vol. 21, no. 12, pp. 10 366–10 380, Dec. 2022