Heterogeneous AAV Logistics Task Allocation: A Reinforcement Learning Enhanced Overlapping Coalition Formation Game Approach
Pith reviewed 2026-06-29 17:35 UTC · model grok-4.3
The pith
A transformer-based reinforcement learning policy enhances overlapping coalition formation for heterogeneous AAV logistics task allocation, reducing generalized costs and guaranteeing Nash-stable convergence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that by embedding a transformer-based soft actor-critic network into an overlapping coalition formation game, heterogeneous AAVs can dynamically form overlapping coalitions for stochastic time-sensitive tasks, where the coalition formation constitutes an exact potential game that converges to Nash-stable equilibrium, leading to a 39.76 percent reduction in the generalized logistics cost compared to heuristic methods in a 32 AAV and 80 task scenario.
What carries the argument
The transformer-based soft actor-critic network, which uses multi-head self-attention to encode variable-length logistics states and capture spatiotemporal dependencies to guide coalition updates in the overlapping coalition formation game.
Load-bearing premise
The model assumes that global optimality can be captured by a single generalized logistics cost coupling service quality and resource consumption, and that the transformer policy produces reliable updates for time-varying task sets.
What would settle it
Observing that the coalition formation process fails to reach a Nash-stable equilibrium in repeated simulations, or that the cost reduction does not materialize in scenarios with higher task variability, would challenge the central claims.
read the original abstract
In dynamic urban logistics, the stochastic emergence of time-sensitive tasks poses a significant optimality challenge for heterogeneous AAVs logistics task allocation. To address this problem, a reinforcement learning enhanced overlapping coalition formation game approach is proposed. A dynamic task allocation model is established, where global optimality is mathematically quantified by a generalized logistics cost coupling service quality and resource consumption. To deal with the time-varying task sets induced by stochastic order arrivals, a transformer-based soft actor-critic network is designed. By leveraging multi-head self-attention to encode variable-length logistics states and capture task-wise spatiotemporal dependencies, the learned policy adaptively guides coalition updates, replacing heuristic rules in the overlapping coalition formation game. On this basis, heterogeneous AAVs can form more efficient overlapping coalitions for dynamic logistics tasks. The resulting coalition formation process is proven to constitute an exact potential game, which guarantees convergence to a Nash-stable equilibrium within a finite number of iterations. Numerical simulations demonstrate that the proposed algorithm effectively improves the optimality of task allocation under the generalized logistics cost criterion. In a scenario with 32 AAVs and 80 tasks, our algorithm achieves a 39.76% cost reduction compared with the heuristic OCF baseline. Indoor flight experiments further validate its practicality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a reinforcement learning enhanced overlapping coalition formation game approach for heterogeneous AAV logistics task allocation in dynamic urban settings. It establishes a dynamic task allocation model quantified by a generalized logistics cost, designs a transformer-based soft actor-critic network to encode variable-length states and guide coalition updates adaptively, proves that the resulting coalition formation constitutes an exact potential game guaranteeing finite convergence to a Nash-stable equilibrium, and reports a 39.76% cost reduction versus a heuristic OCF baseline in a 32-AAV/80-task simulation scenario along with indoor flight experiments.
Significance. If the exact potential game property is preserved under the learned RL policy and the performance gains are robust, the work could contribute a theoretically grounded hybrid method for stochastic task allocation that improves on heuristic baselines while providing convergence guarantees. The use of multi-head self-attention for spatiotemporal dependencies in logistics states and the experimental validation are positive elements.
major comments (2)
- [Abstract] Abstract: The central claim that the coalition formation process is an exact potential game (guaranteeing finite Nash-stable convergence) is load-bearing, yet no derivation is supplied. Because the transformer-based SAC policy is trained directly on the generalized logistics cost and produces state-dependent updates, it is unclear whether the individual utilities remain aligned with any global potential function; the RL guidance may introduce non-local or non-myopic dependencies that invalidate the exact potential property even if the underlying game without RL satisfies it.
- [Numerical simulations] Numerical simulations paragraph: The reported 39.76% cost reduction for 32 AAVs and 80 tasks is presented without statistical significance tests, error bars, variance across runs, or an explicit definition of the heuristic OCF baseline and the precise components of the generalized logistics cost, preventing assessment of whether the improvement reflects genuine generalization or fitting to the training criterion.
minor comments (1)
- [Abstract] The abstract references indoor flight experiments for practicality validation but supplies no quantitative results, setup parameters, or comparison metrics in the provided text.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to strengthen the presentation while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the coalition formation process is an exact potential game (guaranteeing finite Nash-stable convergence) is load-bearing, yet no derivation is supplied. Because the transformer-based SAC policy is trained directly on the generalized logistics cost and produces state-dependent updates, it is unclear whether the individual utilities remain aligned with any global potential function; the RL guidance may introduce non-local or non-myopic dependencies that invalidate the exact potential property even if the underlying game without RL satisfies it.
Authors: The exact potential game property holds for the underlying overlapping coalition formation game, where individual utilities are explicitly constructed to align with the generalized logistics cost serving as the potential function. The transformer-based SAC policy is trained to optimize this same cost but functions only as an adaptive selector of which valid coalition updates to execute; it does not alter the utility definitions or introduce non-myopic dependencies into the game structure. Consequently, best-response dynamics remain aligned with the potential, preserving finite convergence to a Nash-stable equilibrium. We will add an explicit derivation of the potential function and a proof of the exact potential property (including the role of the learned policy) to the revised manuscript. revision: yes
-
Referee: [Numerical simulations] Numerical simulations paragraph: The reported 39.76% cost reduction for 32 AAVs and 80 tasks is presented without statistical significance tests, error bars, variance across runs, or an explicit definition of the heuristic OCF baseline and the precise components of the generalized logistics cost, preventing assessment of whether the improvement reflects genuine generalization or fitting to the training criterion.
Authors: We agree that additional statistical detail and explicit definitions are required for rigorous evaluation. In the revised manuscript we will report results with error bars and standard deviation across independent runs, include statistical significance tests against the baseline, provide a precise definition of the heuristic OCF baseline, and fully specify the components of the generalized logistics cost. revision: yes
Circularity Check
Derivation chain is self-contained with no circular reductions
full rationale
The paper defines a generalized logistics cost as the global optimality criterion, designs a transformer-based SAC policy to guide coalition updates, and states that the resulting coalition formation process constitutes an exact potential game with finite convergence to Nash equilibrium. The proof is presented as following from the structure of the overlapping coalition formation game itself. No step reduces by construction to a fitted parameter renamed as prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. Simulations and experiments provide external numerical benchmarks against a heuristic OCF baseline using the same cost function, keeping the theoretical claim independent of the learned policy outputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Holistic service pro visioning in a UAV -UGV integrated network for last -mile delivery,
J. Xu, X. Liu, J. Jin, W. Pan, X. Li, and Y. Yang, “Holistic service pro visioning in a UAV -UGV integrated network for last -mile delivery,” I EEE Trans. Netw. Serv. Manage. , vol. 22, no. 1, pp. 380 –393, Feb. 20 25, doi: 10.1109/TNSM.2024.3487357
-
[2]
Q. Wei, R. Li, W. Bai, and Z. Han, “Multi -UAV-enabled energy-effici ent data delivery for low -altitude economy: Joint coded caching, user grouping, and UAV deployment,” IEEE Internet Things J. , pp. 1–1, 2 025, doi: 10.1109/JIOT.2025.3562872
-
[3]
Y. Cao, T. Long, J. Sun, Z. Wang, and G. Xu, “Comparison of distrib uted task allocation algorithms considering non -ideal communication f actors for multi -UAV collaborative visit missions,” IEEE Robot. Auto m. Lett., vol. 10, no. 2, pp. 1928 –1935, Feb. 2025, doi: 10.1109/LRA. 2023.3295999
work page doi:10.1109/lra 1928
-
[4]
A review of task allocati on methods for UAVs,
G. M. Skaltsis, H.-S. Shin, and A. Tsourdos, “A review of task allocati on methods for UAVs,” J. Intell. Rob. Syst., vol. 109, no. 4, p. 76, Dec. 2023, doi: 10.1007/s10846 -023-02011-0
-
[5]
Review of dynamic task allocation met hods for UAV swarms oriented to ground targets,
Q. Peng, H. Wu, and R. Xue, “Review of dynamic task allocation met hods for UAV swarms oriented to ground targets,” Complex Syst. Mod el. Simul., vol. 1, no. 3, pp. 163 –175, Sep. 2021, doi: 10.23919/CSMS. 2021.0022
-
[6]
B. Jiang, Y. Li, C. Li, and Y. Zheng, “Bi -level optimization framewor k for urban low-altitude UAV delivery ensuring target level of safety,” IEEE Trans. Intell. Transport. Syst. , pp. 1 –14, 2026, doi: 10.1109/TI TS.2026.3660878
work page doi:10.1109/ti 2026
-
[7]
Urban on -demand delivery via autonomous aerial mobility: Formulation and exact algorithm,
Z. Pei, T. Fang, K. Weng, and W. Yi, “Urban on -demand delivery via autonomous aerial mobility: Formulation and exact algorithm,” IEEE Trans. Autom. Sci. Eng., vol. 20, no. 3, pp. 1675 –1689, Jul. 2023, doi: 10.1109/TASE.2022.3184324
-
[8]
E. Odeh, S. Singh, R. Mizouni, and H. Otrok, “Crowdsourced auction - based framework for time -critical and budget-constrained last mile del ivery,” Inf. Process. Manage. , vol. 62, no. 1, p. 103888, Jan. 2025, doi: 10.1016/j.ipm.2024.103888
-
[9]
Z. Zhen, L. Wen, B. Wang, Z. Hu, and D. Zhang, “Improved contract network protocol algorithm based cooperative target allocation of hete AAV3 AAV2 AAV1 AAV0 AAV3 AAV2 AAV1 AAV0 (a) (b) Fig. 9. Indoor flight experiments for dynamic task reallocation. (a) First reallocation triggered by newly emerged tasks at T = 5 s. (b) Second reallocation trig- gered by ...
-
[10]
Y. Yan, W. Bi, G. Ma, and A. Zhang, “Collaborative task allocation fo r large-scale heterogeneous UAV swarm: A hierarchical coalition for mation game method,” IEEE Internet Things J. , pp. 1–1, 2025, doi: 10. 1109/JIOT.2025.3562692
-
[11]
Z. Zhang, J. Jiang, K. V. Ling, X. Wang, and W. -A. Zhang, “Cooperat ive task allocation and path planning for multi -UAVs in low-altitude u rban intelligent transportation systems,” IEEE Trans. Intell. Transport. Syst., pp. 1–13, 2026, doi: 10.1109/TITS.2026.3667967
-
[12]
Coalition -based facility location optimization for urban UAV logistics,
L. Liu and Z. Gong, “Coalition -based facility location optimization for urban UAV logistics,” Transportation Research Part C: Emerging Te chnologies, vol. 186, p. 105624, May 2026, doi: 10.1016/j.trc.2026.10 5624
-
[13]
Y. Zhang, X. Gao, N. Ye, D. Niyato, Z. Han, and K. Yang, “Joint UA V deployment, power allocation, and coalition formation for physical l ayer security in heterogeneous networks,” IEEE Trans. Veh. Technol., vol. 74, no. 7, pp. 10994 –11009, Jul. 2025, doi: 10.1109/TVT.2025.35 48987
-
[14]
Y. Li, Z. Zhang, Z. He, and Q. Sun, “A heuristic task allocation metho d based on overlapping coalition formation game for heterogeneous U AVs,” IEEE Internet Things J., vol. 11, no. 17, pp. 28945 –28959, Sep. 2024, doi: 10.1109/JIOT.2024.3406336
-
[15]
N. Qi, Z. Huang, F. Zhou, Q. Shi, Q. Wu, and M. Xiao, “A task -driven sequential overlapping coalition formation game for resource allocati on in heterogeneous UAV networks,” IEEE Trans. on Mobile Comput., vol. 22, no. 8, pp. 4439 –4455, Aug. 2023, doi: 10.1109/TMC.2022.31 65965
-
[16]
DDL: Empowering delivery drones with large -scale u rban sensing capability,
X. Chen et al., “DDL: Empowering delivery drones with large -scale u rban sensing capability,” IEEE J. Sel. Topics Signal Process. , vol. 18, no. 3, pp. 502–515, Apr. 2024, doi: 10.1109/JSTSP.2024.3427371
-
[17]
J. Gao et al., “Cooperative air-ground instant delivery by UAVs and cr owdsourced taxis: Joint UAV station deployment and delivery schedul ing,” IEEE Trans. Mobile Comput. , vol. 25, no. 5, pp. 6133 –6149, Ma y 2026, doi: 10.1109/TMC.2025.3634430
-
[18]
Centralized task allocation for multiple UAVs in time -cons traint industrial IoT operations,
M. A. Houran, G. Srivastava, J. Mirza, A. Ranjha, M. A. Javed, and M. H. Zafar, “Centralized task allocation for multiple UAVs in time -cons traint industrial IoT operations,” IEEE Internet Things J. , vol. 12, no. 18, pp. 37529–37537, Sep. 2025, doi: 10.1109/JIOT.2025.3584277
-
[19]
D. Liu, L. Dou, R. Zhang, X. Zhang, and Q. Zong, “Multi -agent reinfo rcement learning-based coordinated dynamic task allocation for hetero genous UAVs,” IEEE Trans. Veh. Technol. , vol. 72, no. 4, pp. 4372 –4 383, Apr. 2023, doi: 10.1109/TVT.2022.3228198
-
[20]
Fast task allocatio n for heterogeneous unmanned aerial vehicles through reinforcement l earning,
X. Zhao, Q. Zong, B. Tian, B. Zhang, and M. You, “Fast task allocatio n for heterogeneous unmanned aerial vehicles through reinforcement l earning,” Aerospace Science and Technology , vol. 92, pp. 588 –594, S ep. 2019, doi: 10.1016/j.ast.2019.06.024
-
[21]
Energy efficient task cooperation for multi -UAV netw orks: A coalition formation game approach,
H. Luan et al., “Energy efficient task cooperation for multi -UAV netw orks: A coalition formation game approach,” IEEE Access, vol. 8, pp. 149372–149384, 2020, doi: 10.1109/ACCESS.2020.3016009
-
[22]
Differential flatness -based fast tr ajectory planning for fixed -wing autonomous aerial vehicles,
J. Li, J. Sun, T. Long, and Z. Zhou, “Differential flatness -based fast tr ajectory planning for fixed -wing autonomous aerial vehicles,” IEEE T rans. Syst., Man, Cybern., Syst., pp. 1–14, 2025, doi: 10.1109/TSMC.2 025.3559591
-
[23]
J. Chen et al., “Joint task assignment and spectrum allocation in hetero geneous UAV communication networks: A coalition formation game -t heoretic approach,” IEEE Trans. Wireless Commun. , vol. 20, no. 1, pp. 440–452, Jan. 2021, doi: 10.1109/TWC.2020.3025316
-
[24]
L. Yu, Z. Li, N. Ansari, and X. Sun, “Hybrid transformer based multi - agent reinforcement learning for multiple unpiloted aerial vehicle coor dination in air corridors,” IEEE Trans. Mobile Comput. , vol. 24, no. 6, pp. 5482–5495, Jun. 2025, doi: 10.1109/TMC.2025.3532204
-
[25]
Training stochastic model recognition algorithms as net works can lead to maximum mutual information estimation of paramet ers,
J. S. Bridle, “Training stochastic model recognition algorithms as net works can lead to maximum mutual information estimation of paramet ers,” pp. 1–7
-
[26]
Soft Actor-Critic Algorithms and Applications
T. Haarnoja et al., “Soft actor-critic algorithms and applications,” Jan. 29, 2019, arXiv: arXiv:1812.05905. doi: 10.48550/arXiv.1812.05905
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1812.05905 2019
-
[27]
UAV-assisted real-time video transmission for vehicles: A soft actor –critic DRL approach,
D. Wu et al., “UAV-assisted real-time video transmission for vehicles: A soft actor –critic DRL approach,” IEEE Internet Things J. , vol. 11, no. 8, pp. 14710–14726, Apr. 2024, doi: 10.1109/JIOT.2023.3343590
-
[28]
Principles of tabu search,
F. Glover, M. Laguna, and R. Marti, “Principles of tabu search,” 2007
2007
-
[29]
D. Monderer and L. S. Shapley, “Potential games,” Games and Econo mic Behavior, vol. 14, no. 1, pp. 124 –143, May 1996, doi: 10.1006/ga me.1996.0044
work page doi:10.1006/ga 1996
-
[30]
F. Yan, J. Chu, J. Hu, and X. Zhu, “Cooperative task allocation with si multaneous arrival and resource constraint for multi -UAV using a gen etic algorithm,” Expert Systems with Applications , vol. 245, p. 123023, Jul. 2024, doi: 10.1016/j.eswa.2023.123023
-
[31]
A two -stage game framework to sec ure transmission in two -tier UAV networks,
M. Xu, Y. Chen, and W. Wang, “A two -stage game framework to sec ure transmission in two -tier UAV networks,” IEEE Trans. Veh. Techn ol., vol. 69, no. 11, pp. 13728 –13740, Nov. 2020, doi: 10.1109/TVT.2 020.3026184
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.