pith. sign in

arxiv: 2606.18841 · v1 · pith:CYGHEMLKnew · submitted 2026-06-17 · 💻 cs.CV

Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework

Pith reviewed 2026-06-26 21:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords air-ground collaborationprogressive cross-taskAGPC benchmarksocialized co-perceptiondual-layer routernegative transferheterogeneous viewsvisual perception
0
0 comments X

The pith

Task-conditioned collaboration outperforms uniform fusion for heterogeneous air-ground perception by reducing negative transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models air-ground perception as a progressive sequence of dependent tasks rather than a single uniform fusion step, because aerial and ground views differ in geometry, scale, and occlusion. It introduces the AGPC benchmark of over 745K aligned video frames to test this modeling choice. The Socialized Co-Perception framework uses a Dual-Layer Router to select features selectively across views and tasks while blocking harmful interference. Experiments on the benchmark report a 3.73% coevolutionary gain and 7.86% higher average downstream performance than uniform-fusion baselines. These outcomes indicate that conditioning collaboration on task order and view differences produces measurable improvements over standard single-task fusion.

Core claim

The paper establishes that air-ground perception should be formulated as progressive cross-task collaboration, supported by the AGPC benchmark and implemented through the Socialized Co-Perception framework whose Dual-Layer Router decouples multi-scale expert selection from task-conditioned modulation, producing a 3.73% coevolutionary gain and 7.86% improvement in average downstream performance over uniform fusion.

What carries the argument

The Dual-Layer Router, which separates input-side multi-scale expert selection from output-side task-conditioned modulation to enable selective cross-view and cross-task interaction.

If this is right

  • Task-conditioned routing reduces negative transfer across heterogeneous views.
  • Aerial localization supplies useful priors for subsequent ground target association.
  • Identity-aware parsing improves when it follows the prior cross-task stages.
  • Average performance across localization, association, and parsing rises by 7.86%.
  • The AGPC benchmark supplies a standardized testbed for evaluating progressive methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same router structure could be tested on other multi-view settings such as vehicle-to-infrastructure perception.
  • Deployment in varying weather or lighting would reveal whether the selective routing remains stable.
  • Adding temporal consistency constraints across video frames might further increase the coevolutionary gain.
  • The progressive ordering could be learned rather than fixed if downstream tasks vary in priority.

Load-bearing premise

Differences in geometry, scale, and occlusion between aerial and ground views make uniform feature sharing prone to negative transfer.

What would settle it

A controlled test on the AGPC benchmark in which a uniform fusion baseline matches or exceeds the reported 3.73% coevolutionary gain and 7.86% downstream improvement would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.18841 by Boan Tao, Pengfei Zhu, Ruipu Zhao, Xinjie Yao, Yiming Sun, Yunqi Zhu, Zhen Wang, Zhihe Fan, Zhoupeng Guo.

Figure 1
Figure 1. Figure 1: Social co-perception in animal and machine societies [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistical analysis and qualitative examples of AGPC dataset. (a) The distribution of object categories and dataset types. (b) The instance counts for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The proposed SCP framework. It establishes a progressive inference pipeline comprising three stages: (1) Global localization; (2) Cross-view search; [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the DLR. It comprises the I-Router for adaptive multi-scale feature aggregation and the O-Router for dynamic residual gating, facilitating [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73\% coevolutionary gain and a 7.86\% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at https://github.com/g1136639260-spec/AGSCP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces the Air-Ground Progressive Collaboration (AGPC) benchmark comprising more than 745K spatio-temporally aligned video frames and proposes the Socialized Co-Perception (SCP) framework, whose core Dual-Layer Router (DLR) module decouples multi-scale expert selection from task-conditioned modulation. It claims that this progressive cross-task approach yields a 3.73% coevolutionary gain and 7.86% improvement in average downstream performance over uniform fusion, thereby mitigating negative transfer arising from geometric, scale, and occlusion discrepancies between aerial and ground views.

Significance. If the empirical claims hold under rigorous controls, the work would contribute a large-scale aligned benchmark and a task-conditioned collaboration mechanism to air-ground perception, a growing area in computer vision. The public code release at the cited GitHub repository and the benchmark construction itself constitute concrete strengths that support reproducibility and further research.

major comments (1)
  1. [Abstract and Experiments] Abstract and Experiments section: the central claim of a 3.73% coevolutionary gain and 7.86% average downstream improvement is presented without any description of the baselines, number of runs, statistical tests, error bars, train/validation/test splits, or controls for confounds. This information is load-bearing for assessing whether the DLR-driven task-conditioned interaction genuinely outperforms uniform fusion on the AGPC benchmark.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental rigor. We agree that the reported gains require explicit supporting details to allow proper evaluation and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim of a 3.73% coevolutionary gain and 7.86% average downstream improvement is presented without any description of the baselines, number of runs, statistical tests, error bars, train/validation/test splits, or controls for confounds. This information is load-bearing for assessing whether the DLR-driven task-conditioned interaction genuinely outperforms uniform fusion on the AGPC benchmark.

    Authors: We acknowledge that the current manuscript does not provide sufficient detail on these experimental aspects. In the revised version we will expand the Experiments section to explicitly list all baselines and their configurations, report the number of independent runs together with statistical tests (e.g., paired t-tests) and error bars, document the precise train/validation/test splits on the AGPC benchmark, and include additional controls or ablations that address potential confounds arising from geometric, scale, and occlusion differences. We will also add a brief reference to these controls in the abstract. These additions will directly substantiate the claimed 3.73% coevolutionary gain and 7.86% downstream improvement over uniform fusion. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical study: it constructs the AGPC benchmark from 745K aligned frames and evaluates the SCP framework with its DLR module on downstream tasks, reporting measured gains (3.73% coevolutionary, 7.86% average) over uniform fusion baselines. No equations, parameter-fitting steps, or derivations appear in the abstract or described claims that reduce the reported improvements to quantities defined by the inputs themselves. The central claim rests on experimental comparison rather than any self-definitional, fitted-input, or self-citation chain that collapses by construction. The benchmark construction and code release supply independent grounding, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into model hyperparameters or background assumptions; the primary addition is the new routing module and the benchmark itself.

free parameters (1)
  • Hyperparameters of SCP and DLR
    Deep learning frameworks typically contain many tunable values; none are specified in the abstract.
axioms (1)
  • domain assumption Functional dependencies exist among localization, target association, and fine-grained parsing that justify progressive rather than single-task modeling.
    Invoked in the abstract to motivate the cross-task formulation.
invented entities (1)
  • Dual-Layer Router (DLR) no independent evidence
    purpose: Decouples input-side multi-scale expert selection from output-side task-conditioned modulation to enable selective interaction.
    New module introduced as the core of the SCP framework.

pith-pipeline@v0.9.1-grok · 5804 in / 1439 out tokens · 31039 ms · 2026-06-26T21:45:06.242144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 1 canonical work pages

  1. [1]

    The social function of intellect,

    N. K. Humphrey, “The social function of intellect,”Cambridge University Press, pp. 303–317, 1976

  2. [2]

    The social brain hypothesis,

    R. I. Dunbar, “The social brain hypothesis,”Evol. Anthropol., vol. 6, no. 5, pp. 178–190, 1998

  3. [3]

    Multi-task learning for dense prediction tasks: A survey,

    S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, and L. Van Gool, “Multi-task learning for dense prediction tasks: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3614–3633, 2022

  4. [4]

    Gradient surgery for multi-task learning,

    T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, 2020

  5. [5]

    Conflict-averse gradient descent for multi-task learning,

    B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu, “Conflict-averse gradient descent for multi-task learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, 2021

  6. [6]

    Spatial-aware feature aggregation for image based cross-view geo-localization,

    Y . Shi, L. Liu, X. Yu, and H. Li, “Spatial-aware feature aggregation for image based cross-view geo-localization,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019. 12

  7. [7]

    Vigor: Cross-view image geo-localization beyond one-to-one retrieval,

    S. Zhu, T. Yang, and C. Chen, “Vigor: Cross-view image geo-localization beyond one-to-one retrieval,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 3640–3649

  8. [8]

    Transgeo: Transformer is all you need for cross-view image geo-localization,

    S. Zhu, M. Shah, and C. Chen, “Transgeo: Transformer is all you need for cross-view image geo-localization,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 1162–1171

  9. [9]

    Taskprompter: Spatial-channel multi-task prompting for dense scene understanding,

    H. Ye and D. Xu, “Taskprompter: Spatial-channel multi-task prompting for dense scene understanding,” inInt. Conf. Learn. Represent. (ICLR), 2023

  10. [10]

    Common ravens, corvus corax, preferentially associate with grey wolves, canis lupus, as a foraging strategy in winter,

    D. Stahler, B. Heinrich, and D. Smith, “Common ravens, corvus corax, preferentially associate with grey wolves, canis lupus, as a foraging strategy in winter,”Anim. Behav., vol. 64, no. 2, pp. 283–290, 2002

  11. [11]

    Socialized coevolution: Advancing a better world through cross-task collaboration,

    X. Yao, Y . Wang, P. Zhu, W. Lin, R. Zhao, Z. Guo, W. Li, and Q. Hu, “Socialized coevolution: Advancing a better world through cross-task collaboration,” inProc. Int. Conf. Mach. Learn. (ICML), vol. 267, 2025, pp. 71 780–71 797

  12. [12]

    Cooperative task assignment for aerial-ground detection systems via a novel hybrid genetic method,

    L. Yu, Y . Yang, X. Su, S. Sun, T. Jiang, and J. Huang, “Cooperative task assignment for aerial-ground detection systems via a novel hybrid genetic method,”IEEE Trans. Ind. Electron., vol. 72, no. 4, pp. 4063–4072, 2025

  13. [13]

    A monocular vision-based localization system of size-uncertain ground targets for uavs,

    J. Chen, G. Zhang, H. Jiang, and Y . He, “A monocular vision-based localization system of size-uncertain ground targets for uavs,”IEEE Trans. Instrum. Meas., vol. 74, pp. 1–10, 2025

  14. [14]

    Ag-reid. v2: Bridging aerial and ground views for person re-identification,

    H. Nguyen, K. Nguyen, S. Sridharan, and C. Fookes, “Ag-reid. v2: Bridging aerial and ground views for person re-identification,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 2896–2908, 2024

  15. [15]

    Ag-vpreid: A challenging large-scale benchmark for aerial-ground video- based person re-identification,

    H. Nguyen, K. Nguyen, A. Pemasiri, F. Liu, S. Sridharan, and C. Fookes, “Ag-vpreid: A challenging large-scale benchmark for aerial-ground video- based person re-identification,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 1241–1251

  16. [16]

    Detreidx: A stress-test dataset for real-world uav-based person recognition,

    K. A. Hambarde, N. Mbongo, M. Pavan Kumar, S. Mekewad, C. Fernan- des, G. Silahtaroglu, A. Nithya, P. Wasnik, M. Rashidunnabi, P. Samale, and H. Proenca, “Detreidx: A stress-test dataset for real-world uav-based person recognition,”IEEE Trans. Biometrics, Behav., Identity Sci., vol. 8, no. 3, pp. 365–377, 2026

  17. [17]

    Agvot: Visual object tracking via cooperation of aerial and ground views,

    K. Yan, W. Qian, J. Cao, and C. Bi, “Agvot: Visual object tracking via cooperation of aerial and ground views,”IEEE Trans. Intell. Transp. Syst., vol. 27, no. 1, pp. 1416–1425, 2026

  18. [18]

    Air–ground cooperative multitarget hierarchical tracking method based on aerial fisheye view,

    Y . Cui, H. Lu, X. Dong, J. Xiang, D. Li, and Z. Tu, “Air–ground cooperative multitarget hierarchical tracking method based on aerial fisheye view,”IEEE Trans. Syst., Man, Cybern., Syst., vol. 55, no. 11, pp. 7651–7662, 2025

  19. [19]

    A2visr: An active and adaptive ground–aerial localization system using visual inertial and single-range fusion,

    S. Chen and W. Dong, “A2visr: An active and adaptive ground–aerial localization system using visual inertial and single-range fusion,”IEEE Trans. Ind. Electron., vol. 73, no. 5, pp. 7340–7349, 2026

  20. [20]

    Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities,

    S. Hu, Z. Fang, Y . Deng, X. Chen, and Y . Fang, “Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities,”IEEE Wireless Commun., vol. 32, no. 5, pp. 228–234, 2025

  21. [21]

    Vehicle-road-cloud collaborative perception framework and key technologies: A review,

    B. Gao, J. Liu, H. Zou, J. Chen, L. He, and K. Li, “Vehicle-road-cloud collaborative perception framework and key technologies: A review,” IEEE Trans. Intell. Transp. Syst., vol. 25, no. 12, pp. 19 295–19 318, 2024

  22. [22]

    Agc-drive: A large-scale dataset for real-world aerial-ground collaboration in driving scenarios,

    Y . Hou, B. Zou, M. Zhang, S. Yang, Y . Zhang, J. Zhuo, S. Chen, J. Chen, and H. Ma, “Agc-drive: A large-scale dataset for real-world aerial-ground collaboration in driving scenarios,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 38, 2025

  23. [23]

    Coopercept: Cooperative perception for 3d object detection of autonomous vehicles,

    Y . Zhang, B. Chen, J. Qin, F. Hu, and J. Hao, “Coopercept: Cooperative perception for 3d object detection of autonomous vehicles,”Drones, vol. 8, no. 6, p. 228, 2024

  24. [24]

    Research challenges and progress in the end-to-end v2x cooperative autonomous driving competition,

    R. Hao, H. Yu, J. Zhong, C. Wang, J. Wang, Y . Kan, W. Yang, S. Fan, H. Yin, J. Qiu, Y . Mu, J. Sun, L. Chen, W. Zimmer, D. Zhang, S. Zhang, M. Schwager, P. Luo, and Z. Nie, “Research challenges and progress in the end-to-end v2x cooperative autonomous driving competition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025, pp. 1828–1839

  25. [25]

    Di-v2x: Learning domain-invariant representation for vehicle-infrastructure collaborative 3d object detection,

    X. Li, J. Yin, W. Li, C. Xu, R. Yang, and J. Shen, “Di-v2x: Learning domain-invariant representation for vehicle-infrastructure collaborative 3d object detection,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 38, no. 4, 2024, pp. 3208–3215

  26. [26]

    Multi-task learning with multi-query transformer for dense prediction,

    Y . Xu, X. Li, H. Yuan, Y . Yang, and L. Zhang, “Multi-task learning with multi-query transformer for dense prediction,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 2, pp. 1228–1240, 2024

  27. [27]

    Learning category- and instance-aware pixel embedding for fast panoptic segmentation,

    N. Gao, Y . Shan, X. Zhao, and K. Huang, “Learning category- and instance-aware pixel embedding for fast panoptic segmentation,”IEEE Trans. Image Process., vol. 30, pp. 6013–6023, 2021

  28. [28]

    Mask ssd: An effective single-stage approach to object instance segmentation,

    H. Zhang, Y . Tian, K. Wang, W. Zhang, and F.-Y . Wang, “Mask ssd: An effective single-stage approach to object instance segmentation,”IEEE Trans. Image Process., vol. 29, pp. 2078–2093, 2020

  29. [29]

    Mtsam: Multi-task fine- tuning for segment anything model,

    X. Wang, Z. ZHUANG, F. YE, and Y . Zhang, “Mtsam: Multi-task fine- tuning for segment anything model,” inInt. Conf. Learn. Represent. (ICLR), Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 95 268–95 289

  30. [30]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 24 185–24 198

  31. [31]

    Fedhca2: Towards hetero-client federated multi-task learning,

    Y . Lu, S. Huang, Y . Yang, S. Sirejiding, Y . Ding, and H. Lu, “Fedhca2: Towards hetero-client federated multi-task learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 5599–5609

  32. [32]

    One framework to rule them all: Unifying multimodal tasks with llm neural-tuning,

    H. Sun, Y . Song, J. Liu, J. Hu, Y .-W. Chen, and L. Lin, “One framework to rule them all: Unifying multimodal tasks with llm neural-tuning,” Pattern Recognit., vol. 171, p. 112275, 2026

  33. [33]

    Learning multiple tasks with multilinear relationship networks,

    M. Long, Z. Cao, J. Wang, and P. S. Yu, “Learning multiple tasks with multilinear relationship networks,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017

  34. [34]

    Mtmlnet: Multi-task mutual learning network for infrared small target detection and segmentation,

    B. Yang, F. Li, S. Zhao, W. Wang, J. Luo, H. Pu, M. Zhou, and Y . Pi, “Mtmlnet: Multi-task mutual learning network for infrared small target detection and segmentation,”IEEE Trans. Image Process., vol. 34, pp. 4414–4425, 2025

  35. [35]

    Dense pixel-level interpretation of dynamic scenes with video panoptic segmentation,

    D. Kim, S. Woo, J.-Y . Lee, and I. S. Kweon, “Dense pixel-level interpretation of dynamic scenes with video panoptic segmentation,” IEEE Trans. Image Process., vol. 31, pp. 5383–5395, 2022

  36. [36]

    Instance motion tendency learning for video panoptic segmentation,

    L. Wang, H. Liu, S. Zhou, W. Tang, and G. Hua, “Instance motion tendency learning for video panoptic segmentation,”IEEE Trans. Image Process., vol. 32, pp. 764–778, 2023

  37. [37]

    Bdd100k: A diverse driving dataset for heterogeneous multitask learning,

    F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu, V . Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 2636–2645

  38. [38]

    Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection,

    H. Yu, Y . Luo, M. Shu, Y . Huo, Z. Yang, Y . Shi, Z. Guo, H. Li, X. Hu, J. Yuanet al., “Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 21 361–21 370

  39. [39]

    Multi-query vehicle re-identification: Viewpoint-conditioned network, unified dataset and new metric,

    A. Zheng, C. Zhang, C. Li, J. Tang, and C. Tan, “Multi-query vehicle re-identification: Viewpoint-conditioned network, unified dataset and new metric,”IEEE Trans. Image Process., vol. 32, pp. 5948–5960, 2023

  40. [40]

    V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting,

    H. Yu, W. Yang, H. Ruan, Z. Yang, Y . Tang, X. Gao, X. Hao, Y . Shi, Y . Pan, N. Sunet al., “V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 5486–5495

  41. [41]

    Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark,

    J. Wang, X. Cao, J. Zhong, Y . Zhang, Z. Han, H. Yu, C. Zhang, L. He, S. Xu, and J. Wang, “Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 12, 2026, pp. 9867–9875

  42. [42]

    Bayes error estimation using parzen and k-nn procedures,

    K. Fukunaga and D. M. Hummels, “Bayes error estimation using parzen and k-nn procedures,”IEEE Trans. Pattern Anal. Mach. Intell., no. 5, pp. 634–643, 1987

  43. [43]

    Socialized learning: Making each other better through multi-agent collaboration,

    X. Yao, Y . Wang, P. Zhu, W. Lin, J. Li, W. Li, and Q. Hu, “Socialized learning: Making each other better through multi-agent collaboration,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 235, 2024, pp. 56 927–56 945

  44. [44]

    Mutual information driven equivariant contrastive learning for 3d action representation learning,

    L. Lin, J. Zhang, and J. Liu, “Mutual information driven equivariant contrastive learning for 3d action representation learning,”IEEE Trans. Image Process., vol. 33, pp. 1883–1897, 2024

  45. [45]

    A novel approach for effective multi-view clustering with information-theoretic perspective,

    C. Cui, Y . Ren, J. Pu, J. Li, X. Pu, T. Wu, Y . Shi, and L. He, “A novel approach for effective multi-view clustering with information-theoretic perspective,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 44 847–44 859

  46. [46]

    Intern: A new learning paradigm towards general vision,

    J. Shao, S. Chen, Y . Li, K. Wang, Z. Yin, Y . He, J. Teng, Q. Sun, M. Gao, J. Liuet al., “Intern: A new learning paradigm towards general vision,” arXiv preprint arXiv:2111.08687, 2021

  47. [47]

    Tadformer: Task-adaptive dynamic transformer for efficient multi-task learning,

    S. Baek, S. Lee, H. Jo, H. Choi, and D. Min, “Tadformer: Task-adaptive dynamic transformer for efficient multi-task learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 14 858–14 868

  48. [48]

    Few- shot incremental multi-modal learning via touch guidance and imaginary vision synthesis,

    L. Wei, Y . Ma, Z. Lin, F. Wang, C. Jin, H. Zhao, and D. Chen, “Few- shot incremental multi-modal learning via touch guidance and imaginary vision synthesis,” inProc. Int. Joint Conf. Artif. Intell. (IJCAI), 2025, pp. 2045–2053

  49. [49]

    Bidirectional channel- selective semantic interaction for semi-supervised medical segmentation,

    K. Huang, Y . Zhang, Y . Zhou, T. Xu, and T. Zhou, “Bidirectional channel- selective semantic interaction for semi-supervised medical segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 7, p. 5040–5048, Mar. 2026. 13

  50. [50]

    Decoding with structured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation,

    F. Zhang, Z. Gu, and H. Wang, “Decoding with structured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 15, 2026, pp. 12 421–12 429

  51. [51]

    Mambaseg: Harnessing mamba for accurate and efficient image-event semantic segmentation,

    F. Gu, Y . Li, X. Long, K. Ji, C. Chen, Q. Gu, and Z. Ni, “Mambaseg: Harnessing mamba for accurate and efficient image-event semantic segmentation,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 6, 2026, pp. 4302–4310

  52. [52]

    Semc: Structure- enhanced mixture-of-experts contrastive learning for ultrasound standard plane recognition,

    Q. Cai, G. Yan, F. Zhang, C. Zhang, Z. Liuet al., “Semc: Structure- enhanced mixture-of-experts contrastive learning for ultrasound standard plane recognition,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 4, 2026, pp. 2543–2551