Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework

Boan Tao; Pengfei Zhu; Ruipu Zhao; Xinjie Yao; Yiming Sun; Yunqi Zhu; Zhen Wang; Zhihe Fan; Zhoupeng Guo

arxiv: 2606.18841 · v1 · pith:CYGHEMLKnew · submitted 2026-06-17 · 💻 cs.CV

Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework

Zhoupeng Guo , Yunqi Zhu , Zhihe Fan , Xinjie Yao , Ruipu Zhao , Boan Tao , Yiming Sun , Zhen Wang

show 1 more author

Pengfei Zhu

This is my paper

Pith reviewed 2026-06-26 21:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords air-ground collaborationprogressive cross-taskAGPC benchmarksocialized co-perceptiondual-layer routernegative transferheterogeneous viewsvisual perception

0 comments

The pith

Task-conditioned collaboration outperforms uniform fusion for heterogeneous air-ground perception by reducing negative transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models air-ground perception as a progressive sequence of dependent tasks rather than a single uniform fusion step, because aerial and ground views differ in geometry, scale, and occlusion. It introduces the AGPC benchmark of over 745K aligned video frames to test this modeling choice. The Socialized Co-Perception framework uses a Dual-Layer Router to select features selectively across views and tasks while blocking harmful interference. Experiments on the benchmark report a 3.73% coevolutionary gain and 7.86% higher average downstream performance than uniform-fusion baselines. These outcomes indicate that conditioning collaboration on task order and view differences produces measurable improvements over standard single-task fusion.

Core claim

The paper establishes that air-ground perception should be formulated as progressive cross-task collaboration, supported by the AGPC benchmark and implemented through the Socialized Co-Perception framework whose Dual-Layer Router decouples multi-scale expert selection from task-conditioned modulation, producing a 3.73% coevolutionary gain and 7.86% improvement in average downstream performance over uniform fusion.

What carries the argument

The Dual-Layer Router, which separates input-side multi-scale expert selection from output-side task-conditioned modulation to enable selective cross-view and cross-task interaction.

If this is right

Task-conditioned routing reduces negative transfer across heterogeneous views.
Aerial localization supplies useful priors for subsequent ground target association.
Identity-aware parsing improves when it follows the prior cross-task stages.
Average performance across localization, association, and parsing rises by 7.86%.
The AGPC benchmark supplies a standardized testbed for evaluating progressive methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same router structure could be tested on other multi-view settings such as vehicle-to-infrastructure perception.
Deployment in varying weather or lighting would reveal whether the selective routing remains stable.
Adding temporal consistency constraints across video frames might further increase the coevolutionary gain.
The progressive ordering could be learned rather than fixed if downstream tasks vary in priority.

Load-bearing premise

Differences in geometry, scale, and occlusion between aerial and ground views make uniform feature sharing prone to negative transfer.

What would settle it

A controlled test on the AGPC benchmark in which a uniform fusion baseline matches or exceeds the reported 3.73% coevolutionary gain and 7.86% downstream improvement would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.18841 by Boan Tao, Pengfei Zhu, Ruipu Zhao, Xinjie Yao, Yiming Sun, Yunqi Zhu, Zhen Wang, Zhihe Fan, Zhoupeng Guo.

**Figure 2.** Figure 2: Statistical analysis and qualitative examples of AGPC dataset. (a) The distribution of object categories and dataset types. (b) The instance counts for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The proposed SCP framework. It establishes a progressive inference pipeline comprising three stages: (1) Global localization; (2) Cross-view search; [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the DLR. It comprises the I-Router for adaptive multi-scale feature aggregation and the O-Router for dynamic residual gating, facilitating [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73\% coevolutionary gain and a 7.86\% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at https://github.com/g1136639260-spec/AGSCP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New AGPC benchmark with 745K aligned frames plus SCP framework and Dual-Layer Router that decouples scale from task routing, reporting gains over uniform fusion.

read the letter

The main things to know are the AGPC benchmark of more than 745K spatio-temporally aligned video frames and the SCP framework built around a Dual-Layer Router that separates multi-scale selection on the input side from task-conditioned modulation on the output side.

The benchmark is a clear addition because it supplies matched air and ground data at scale, which lets people test collaboration under the geometric, scale, and occlusion mismatches that the abstract flags. The progressive organization—from aerial global localization to ground target association to identity-aware parsing—plus the router’s explicit decoupling, gives a concrete mechanism for selective cross-view and cross-task interaction instead of blanket feature sharing. The reported 3.73% coevolutionary gain and 7.86% average downstream improvement are presented as evidence that task conditioning reduces negative transfer.

The router design and the size of the released dataset are the parts that stand out as useful. The code link also helps anyone who wants to inspect the implementation.

The soft spot is the experimental reporting. The abstract states the percentage gains but gives no information on the baselines, dataset splits, controls, or any error bars or significance checks. That leaves the strength of the central claim dependent on details that are not visible here. If the full paper supplies fair comparisons and shows the gains are not driven mainly by the new benchmark construction itself, the results will land more solidly.

This is aimed at the computer-vision community working on multi-view or collaborative perception, especially air-ground or robotics applications. A reader looking for new aligned data or routing ideas in fusion could extract value.

I would send it for peer review. The benchmark and the router module are substantial enough to merit referee time, even if the experimental section needs closer scrutiny.

Referee Report

1 major / 0 minor

Summary. The paper introduces the Air-Ground Progressive Collaboration (AGPC) benchmark comprising more than 745K spatio-temporally aligned video frames and proposes the Socialized Co-Perception (SCP) framework, whose core Dual-Layer Router (DLR) module decouples multi-scale expert selection from task-conditioned modulation. It claims that this progressive cross-task approach yields a 3.73% coevolutionary gain and 7.86% improvement in average downstream performance over uniform fusion, thereby mitigating negative transfer arising from geometric, scale, and occlusion discrepancies between aerial and ground views.

Significance. If the empirical claims hold under rigorous controls, the work would contribute a large-scale aligned benchmark and a task-conditioned collaboration mechanism to air-ground perception, a growing area in computer vision. The public code release at the cited GitHub repository and the benchmark construction itself constitute concrete strengths that support reproducibility and further research.

major comments (1)

[Abstract and Experiments] Abstract and Experiments section: the central claim of a 3.73% coevolutionary gain and 7.86% average downstream improvement is presented without any description of the baselines, number of runs, statistical tests, error bars, train/validation/test splits, or controls for confounds. This information is load-bearing for assessing whether the DLR-driven task-conditioned interaction genuinely outperforms uniform fusion on the AGPC benchmark.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental rigor. We agree that the reported gains require explicit supporting details to allow proper evaluation and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim of a 3.73% coevolutionary gain and 7.86% average downstream improvement is presented without any description of the baselines, number of runs, statistical tests, error bars, train/validation/test splits, or controls for confounds. This information is load-bearing for assessing whether the DLR-driven task-conditioned interaction genuinely outperforms uniform fusion on the AGPC benchmark.

Authors: We acknowledge that the current manuscript does not provide sufficient detail on these experimental aspects. In the revised version we will expand the Experiments section to explicitly list all baselines and their configurations, report the number of independent runs together with statistical tests (e.g., paired t-tests) and error bars, document the precise train/validation/test splits on the AGPC benchmark, and include additional controls or ablations that address potential confounds arising from geometric, scale, and occlusion differences. We will also add a brief reference to these controls in the abstract. These additions will directly substantiate the claimed 3.73% coevolutionary gain and 7.86% downstream improvement over uniform fusion. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical study: it constructs the AGPC benchmark from 745K aligned frames and evaluates the SCP framework with its DLR module on downstream tasks, reporting measured gains (3.73% coevolutionary, 7.86% average) over uniform fusion baselines. No equations, parameter-fitting steps, or derivations appear in the abstract or described claims that reduce the reported improvements to quantities defined by the inputs themselves. The central claim rests on experimental comparison rather than any self-definitional, fitted-input, or self-citation chain that collapses by construction. The benchmark construction and code release supply independent grounding, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into model hyperparameters or background assumptions; the primary addition is the new routing module and the benchmark itself.

free parameters (1)

Hyperparameters of SCP and DLR
Deep learning frameworks typically contain many tunable values; none are specified in the abstract.

axioms (1)

domain assumption Functional dependencies exist among localization, target association, and fine-grained parsing that justify progressive rather than single-task modeling.
Invoked in the abstract to motivate the cross-task formulation.

invented entities (1)

Dual-Layer Router (DLR) no independent evidence
purpose: Decouples input-side multi-scale expert selection from output-side task-conditioned modulation to enable selective interaction.
New module introduced as the core of the SCP framework.

pith-pipeline@v0.9.1-grok · 5804 in / 1439 out tokens · 31039 ms · 2026-06-26T21:45:06.242144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 1 canonical work pages

[1]

The social function of intellect,

N. K. Humphrey, “The social function of intellect,”Cambridge University Press, pp. 303–317, 1976

1976
[2]

The social brain hypothesis,

R. I. Dunbar, “The social brain hypothesis,”Evol. Anthropol., vol. 6, no. 5, pp. 178–190, 1998

1998
[3]

Multi-task learning for dense prediction tasks: A survey,

S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, and L. Van Gool, “Multi-task learning for dense prediction tasks: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3614–3633, 2022

2022
[4]

Gradient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, 2020

2020
[5]

Conflict-averse gradient descent for multi-task learning,

B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu, “Conflict-averse gradient descent for multi-task learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, 2021

2021
[6]

Spatial-aware feature aggregation for image based cross-view geo-localization,

Y . Shi, L. Liu, X. Yu, and H. Li, “Spatial-aware feature aggregation for image based cross-view geo-localization,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019. 12

2019
[7]

Vigor: Cross-view image geo-localization beyond one-to-one retrieval,

S. Zhu, T. Yang, and C. Chen, “Vigor: Cross-view image geo-localization beyond one-to-one retrieval,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 3640–3649

2021
[8]

Transgeo: Transformer is all you need for cross-view image geo-localization,

S. Zhu, M. Shah, and C. Chen, “Transgeo: Transformer is all you need for cross-view image geo-localization,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 1162–1171

2022
[9]

Taskprompter: Spatial-channel multi-task prompting for dense scene understanding,

H. Ye and D. Xu, “Taskprompter: Spatial-channel multi-task prompting for dense scene understanding,” inInt. Conf. Learn. Represent. (ICLR), 2023

2023
[10]

Common ravens, corvus corax, preferentially associate with grey wolves, canis lupus, as a foraging strategy in winter,

D. Stahler, B. Heinrich, and D. Smith, “Common ravens, corvus corax, preferentially associate with grey wolves, canis lupus, as a foraging strategy in winter,”Anim. Behav., vol. 64, no. 2, pp. 283–290, 2002

2002
[11]

Socialized coevolution: Advancing a better world through cross-task collaboration,

X. Yao, Y . Wang, P. Zhu, W. Lin, R. Zhao, Z. Guo, W. Li, and Q. Hu, “Socialized coevolution: Advancing a better world through cross-task collaboration,” inProc. Int. Conf. Mach. Learn. (ICML), vol. 267, 2025, pp. 71 780–71 797

2025
[12]

Cooperative task assignment for aerial-ground detection systems via a novel hybrid genetic method,

L. Yu, Y . Yang, X. Su, S. Sun, T. Jiang, and J. Huang, “Cooperative task assignment for aerial-ground detection systems via a novel hybrid genetic method,”IEEE Trans. Ind. Electron., vol. 72, no. 4, pp. 4063–4072, 2025

2025
[13]

A monocular vision-based localization system of size-uncertain ground targets for uavs,

J. Chen, G. Zhang, H. Jiang, and Y . He, “A monocular vision-based localization system of size-uncertain ground targets for uavs,”IEEE Trans. Instrum. Meas., vol. 74, pp. 1–10, 2025

2025
[14]

Ag-reid. v2: Bridging aerial and ground views for person re-identification,

H. Nguyen, K. Nguyen, S. Sridharan, and C. Fookes, “Ag-reid. v2: Bridging aerial and ground views for person re-identification,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 2896–2908, 2024

2024
[15]

Ag-vpreid: A challenging large-scale benchmark for aerial-ground video- based person re-identification,

H. Nguyen, K. Nguyen, A. Pemasiri, F. Liu, S. Sridharan, and C. Fookes, “Ag-vpreid: A challenging large-scale benchmark for aerial-ground video- based person re-identification,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 1241–1251

2025
[16]

Detreidx: A stress-test dataset for real-world uav-based person recognition,

K. A. Hambarde, N. Mbongo, M. Pavan Kumar, S. Mekewad, C. Fernan- des, G. Silahtaroglu, A. Nithya, P. Wasnik, M. Rashidunnabi, P. Samale, and H. Proenca, “Detreidx: A stress-test dataset for real-world uav-based person recognition,”IEEE Trans. Biometrics, Behav., Identity Sci., vol. 8, no. 3, pp. 365–377, 2026

2026
[17]

Agvot: Visual object tracking via cooperation of aerial and ground views,

K. Yan, W. Qian, J. Cao, and C. Bi, “Agvot: Visual object tracking via cooperation of aerial and ground views,”IEEE Trans. Intell. Transp. Syst., vol. 27, no. 1, pp. 1416–1425, 2026

2026
[18]

Air–ground cooperative multitarget hierarchical tracking method based on aerial fisheye view,

Y . Cui, H. Lu, X. Dong, J. Xiang, D. Li, and Z. Tu, “Air–ground cooperative multitarget hierarchical tracking method based on aerial fisheye view,”IEEE Trans. Syst., Man, Cybern., Syst., vol. 55, no. 11, pp. 7651–7662, 2025

2025
[19]

A2visr: An active and adaptive ground–aerial localization system using visual inertial and single-range fusion,

S. Chen and W. Dong, “A2visr: An active and adaptive ground–aerial localization system using visual inertial and single-range fusion,”IEEE Trans. Ind. Electron., vol. 73, no. 5, pp. 7340–7349, 2026

2026
[20]

Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities,

S. Hu, Z. Fang, Y . Deng, X. Chen, and Y . Fang, “Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities,”IEEE Wireless Commun., vol. 32, no. 5, pp. 228–234, 2025

2025
[21]

Vehicle-road-cloud collaborative perception framework and key technologies: A review,

B. Gao, J. Liu, H. Zou, J. Chen, L. He, and K. Li, “Vehicle-road-cloud collaborative perception framework and key technologies: A review,” IEEE Trans. Intell. Transp. Syst., vol. 25, no. 12, pp. 19 295–19 318, 2024

2024
[22]

Agc-drive: A large-scale dataset for real-world aerial-ground collaboration in driving scenarios,

Y . Hou, B. Zou, M. Zhang, S. Yang, Y . Zhang, J. Zhuo, S. Chen, J. Chen, and H. Ma, “Agc-drive: A large-scale dataset for real-world aerial-ground collaboration in driving scenarios,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 38, 2025

2025
[23]

Coopercept: Cooperative perception for 3d object detection of autonomous vehicles,

Y . Zhang, B. Chen, J. Qin, F. Hu, and J. Hao, “Coopercept: Cooperative perception for 3d object detection of autonomous vehicles,”Drones, vol. 8, no. 6, p. 228, 2024

2024
[24]

Research challenges and progress in the end-to-end v2x cooperative autonomous driving competition,

R. Hao, H. Yu, J. Zhong, C. Wang, J. Wang, Y . Kan, W. Yang, S. Fan, H. Yin, J. Qiu, Y . Mu, J. Sun, L. Chen, W. Zimmer, D. Zhang, S. Zhang, M. Schwager, P. Luo, and Z. Nie, “Research challenges and progress in the end-to-end v2x cooperative autonomous driving competition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025, pp. 1828–1839

2025
[25]

Di-v2x: Learning domain-invariant representation for vehicle-infrastructure collaborative 3d object detection,

X. Li, J. Yin, W. Li, C. Xu, R. Yang, and J. Shen, “Di-v2x: Learning domain-invariant representation for vehicle-infrastructure collaborative 3d object detection,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 38, no. 4, 2024, pp. 3208–3215

2024
[26]

Multi-task learning with multi-query transformer for dense prediction,

Y . Xu, X. Li, H. Yuan, Y . Yang, and L. Zhang, “Multi-task learning with multi-query transformer for dense prediction,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 2, pp. 1228–1240, 2024

2024
[27]

Learning category- and instance-aware pixel embedding for fast panoptic segmentation,

N. Gao, Y . Shan, X. Zhao, and K. Huang, “Learning category- and instance-aware pixel embedding for fast panoptic segmentation,”IEEE Trans. Image Process., vol. 30, pp. 6013–6023, 2021

2021
[28]

Mask ssd: An effective single-stage approach to object instance segmentation,

H. Zhang, Y . Tian, K. Wang, W. Zhang, and F.-Y . Wang, “Mask ssd: An effective single-stage approach to object instance segmentation,”IEEE Trans. Image Process., vol. 29, pp. 2078–2093, 2020

2078
[29]

Mtsam: Multi-task fine- tuning for segment anything model,

X. Wang, Z. ZHUANG, F. YE, and Y . Zhang, “Mtsam: Multi-task fine- tuning for segment anything model,” inInt. Conf. Learn. Represent. (ICLR), Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 95 268–95 289

2025
[30]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 24 185–24 198

2024
[31]

Fedhca2: Towards hetero-client federated multi-task learning,

Y . Lu, S. Huang, Y . Yang, S. Sirejiding, Y . Ding, and H. Lu, “Fedhca2: Towards hetero-client federated multi-task learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 5599–5609

2024
[32]

One framework to rule them all: Unifying multimodal tasks with llm neural-tuning,

H. Sun, Y . Song, J. Liu, J. Hu, Y .-W. Chen, and L. Lin, “One framework to rule them all: Unifying multimodal tasks with llm neural-tuning,” Pattern Recognit., vol. 171, p. 112275, 2026

2026
[33]

Learning multiple tasks with multilinear relationship networks,

M. Long, Z. Cao, J. Wang, and P. S. Yu, “Learning multiple tasks with multilinear relationship networks,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017

2017
[34]

Mtmlnet: Multi-task mutual learning network for infrared small target detection and segmentation,

B. Yang, F. Li, S. Zhao, W. Wang, J. Luo, H. Pu, M. Zhou, and Y . Pi, “Mtmlnet: Multi-task mutual learning network for infrared small target detection and segmentation,”IEEE Trans. Image Process., vol. 34, pp. 4414–4425, 2025

2025
[35]

Dense pixel-level interpretation of dynamic scenes with video panoptic segmentation,

D. Kim, S. Woo, J.-Y . Lee, and I. S. Kweon, “Dense pixel-level interpretation of dynamic scenes with video panoptic segmentation,” IEEE Trans. Image Process., vol. 31, pp. 5383–5395, 2022

2022
[36]

Instance motion tendency learning for video panoptic segmentation,

L. Wang, H. Liu, S. Zhou, W. Tang, and G. Hua, “Instance motion tendency learning for video panoptic segmentation,”IEEE Trans. Image Process., vol. 32, pp. 764–778, 2023

2023
[37]

Bdd100k: A diverse driving dataset for heterogeneous multitask learning,

F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu, V . Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 2636–2645

2020
[38]

Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection,

H. Yu, Y . Luo, M. Shu, Y . Huo, Z. Yang, Y . Shi, Z. Guo, H. Li, X. Hu, J. Yuanet al., “Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 21 361–21 370

2022
[39]

Multi-query vehicle re-identification: Viewpoint-conditioned network, unified dataset and new metric,

A. Zheng, C. Zhang, C. Li, J. Tang, and C. Tan, “Multi-query vehicle re-identification: Viewpoint-conditioned network, unified dataset and new metric,”IEEE Trans. Image Process., vol. 32, pp. 5948–5960, 2023

2023
[40]

V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting,

H. Yu, W. Yang, H. Ruan, Z. Yang, Y . Tang, X. Gao, X. Hao, Y . Shi, Y . Pan, N. Sunet al., “V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 5486–5495

2023
[41]

Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark,

J. Wang, X. Cao, J. Zhong, Y . Zhang, Z. Han, H. Yu, C. Zhang, L. He, S. Xu, and J. Wang, “Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 12, 2026, pp. 9867–9875

2026
[42]

Bayes error estimation using parzen and k-nn procedures,

K. Fukunaga and D. M. Hummels, “Bayes error estimation using parzen and k-nn procedures,”IEEE Trans. Pattern Anal. Mach. Intell., no. 5, pp. 634–643, 1987

1987
[43]

Socialized learning: Making each other better through multi-agent collaboration,

X. Yao, Y . Wang, P. Zhu, W. Lin, J. Li, W. Li, and Q. Hu, “Socialized learning: Making each other better through multi-agent collaboration,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 235, 2024, pp. 56 927–56 945

2024
[44]

Mutual information driven equivariant contrastive learning for 3d action representation learning,

L. Lin, J. Zhang, and J. Liu, “Mutual information driven equivariant contrastive learning for 3d action representation learning,”IEEE Trans. Image Process., vol. 33, pp. 1883–1897, 2024

2024
[45]

A novel approach for effective multi-view clustering with information-theoretic perspective,

C. Cui, Y . Ren, J. Pu, J. Li, X. Pu, T. Wu, Y . Shi, and L. He, “A novel approach for effective multi-view clustering with information-theoretic perspective,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 44 847–44 859

2023
[46]

Intern: A new learning paradigm towards general vision,

J. Shao, S. Chen, Y . Li, K. Wang, Z. Yin, Y . He, J. Teng, Q. Sun, M. Gao, J. Liuet al., “Intern: A new learning paradigm towards general vision,” arXiv preprint arXiv:2111.08687, 2021

work page arXiv 2021
[47]

Tadformer: Task-adaptive dynamic transformer for efficient multi-task learning,

S. Baek, S. Lee, H. Jo, H. Choi, and D. Min, “Tadformer: Task-adaptive dynamic transformer for efficient multi-task learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 14 858–14 868

2025
[48]

Few- shot incremental multi-modal learning via touch guidance and imaginary vision synthesis,

L. Wei, Y . Ma, Z. Lin, F. Wang, C. Jin, H. Zhao, and D. Chen, “Few- shot incremental multi-modal learning via touch guidance and imaginary vision synthesis,” inProc. Int. Joint Conf. Artif. Intell. (IJCAI), 2025, pp. 2045–2053

2025
[49]

Bidirectional channel- selective semantic interaction for semi-supervised medical segmentation,

K. Huang, Y . Zhang, Y . Zhou, T. Xu, and T. Zhou, “Bidirectional channel- selective semantic interaction for semi-supervised medical segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 7, p. 5040–5048, Mar. 2026. 13

2026
[50]

Decoding with structured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation,

F. Zhang, Z. Gu, and H. Wang, “Decoding with structured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 15, 2026, pp. 12 421–12 429

2026
[51]

Mambaseg: Harnessing mamba for accurate and efficient image-event semantic segmentation,

F. Gu, Y . Li, X. Long, K. Ji, C. Chen, Q. Gu, and Z. Ni, “Mambaseg: Harnessing mamba for accurate and efficient image-event semantic segmentation,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 6, 2026, pp. 4302–4310

2026
[52]

Semc: Structure- enhanced mixture-of-experts contrastive learning for ultrasound standard plane recognition,

Q. Cai, G. Yan, F. Zhang, C. Zhang, Z. Liuet al., “Semc: Structure- enhanced mixture-of-experts contrastive learning for ultrasound standard plane recognition,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 4, 2026, pp. 2543–2551

2026

[1] [1]

The social function of intellect,

N. K. Humphrey, “The social function of intellect,”Cambridge University Press, pp. 303–317, 1976

1976

[2] [2]

The social brain hypothesis,

R. I. Dunbar, “The social brain hypothesis,”Evol. Anthropol., vol. 6, no. 5, pp. 178–190, 1998

1998

[3] [3]

Multi-task learning for dense prediction tasks: A survey,

S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, and L. Van Gool, “Multi-task learning for dense prediction tasks: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3614–3633, 2022

2022

[4] [4]

Gradient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, 2020

2020

[5] [5]

Conflict-averse gradient descent for multi-task learning,

B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu, “Conflict-averse gradient descent for multi-task learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, 2021

2021

[6] [6]

Spatial-aware feature aggregation for image based cross-view geo-localization,

Y . Shi, L. Liu, X. Yu, and H. Li, “Spatial-aware feature aggregation for image based cross-view geo-localization,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019. 12

2019

[7] [7]

Vigor: Cross-view image geo-localization beyond one-to-one retrieval,

S. Zhu, T. Yang, and C. Chen, “Vigor: Cross-view image geo-localization beyond one-to-one retrieval,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 3640–3649

2021

[8] [8]

Transgeo: Transformer is all you need for cross-view image geo-localization,

S. Zhu, M. Shah, and C. Chen, “Transgeo: Transformer is all you need for cross-view image geo-localization,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 1162–1171

2022

[9] [9]

Taskprompter: Spatial-channel multi-task prompting for dense scene understanding,

H. Ye and D. Xu, “Taskprompter: Spatial-channel multi-task prompting for dense scene understanding,” inInt. Conf. Learn. Represent. (ICLR), 2023

2023

[10] [10]

Common ravens, corvus corax, preferentially associate with grey wolves, canis lupus, as a foraging strategy in winter,

D. Stahler, B. Heinrich, and D. Smith, “Common ravens, corvus corax, preferentially associate with grey wolves, canis lupus, as a foraging strategy in winter,”Anim. Behav., vol. 64, no. 2, pp. 283–290, 2002

2002

[11] [11]

Socialized coevolution: Advancing a better world through cross-task collaboration,

X. Yao, Y . Wang, P. Zhu, W. Lin, R. Zhao, Z. Guo, W. Li, and Q. Hu, “Socialized coevolution: Advancing a better world through cross-task collaboration,” inProc. Int. Conf. Mach. Learn. (ICML), vol. 267, 2025, pp. 71 780–71 797

2025

[12] [12]

Cooperative task assignment for aerial-ground detection systems via a novel hybrid genetic method,

L. Yu, Y . Yang, X. Su, S. Sun, T. Jiang, and J. Huang, “Cooperative task assignment for aerial-ground detection systems via a novel hybrid genetic method,”IEEE Trans. Ind. Electron., vol. 72, no. 4, pp. 4063–4072, 2025

2025

[13] [13]

A monocular vision-based localization system of size-uncertain ground targets for uavs,

J. Chen, G. Zhang, H. Jiang, and Y . He, “A monocular vision-based localization system of size-uncertain ground targets for uavs,”IEEE Trans. Instrum. Meas., vol. 74, pp. 1–10, 2025

2025

[14] [14]

Ag-reid. v2: Bridging aerial and ground views for person re-identification,

H. Nguyen, K. Nguyen, S. Sridharan, and C. Fookes, “Ag-reid. v2: Bridging aerial and ground views for person re-identification,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 2896–2908, 2024

2024

[15] [15]

Ag-vpreid: A challenging large-scale benchmark for aerial-ground video- based person re-identification,

H. Nguyen, K. Nguyen, A. Pemasiri, F. Liu, S. Sridharan, and C. Fookes, “Ag-vpreid: A challenging large-scale benchmark for aerial-ground video- based person re-identification,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 1241–1251

2025

[16] [16]

Detreidx: A stress-test dataset for real-world uav-based person recognition,

K. A. Hambarde, N. Mbongo, M. Pavan Kumar, S. Mekewad, C. Fernan- des, G. Silahtaroglu, A. Nithya, P. Wasnik, M. Rashidunnabi, P. Samale, and H. Proenca, “Detreidx: A stress-test dataset for real-world uav-based person recognition,”IEEE Trans. Biometrics, Behav., Identity Sci., vol. 8, no. 3, pp. 365–377, 2026

2026

[17] [17]

Agvot: Visual object tracking via cooperation of aerial and ground views,

K. Yan, W. Qian, J. Cao, and C. Bi, “Agvot: Visual object tracking via cooperation of aerial and ground views,”IEEE Trans. Intell. Transp. Syst., vol. 27, no. 1, pp. 1416–1425, 2026

2026

[18] [18]

Air–ground cooperative multitarget hierarchical tracking method based on aerial fisheye view,

Y . Cui, H. Lu, X. Dong, J. Xiang, D. Li, and Z. Tu, “Air–ground cooperative multitarget hierarchical tracking method based on aerial fisheye view,”IEEE Trans. Syst., Man, Cybern., Syst., vol. 55, no. 11, pp. 7651–7662, 2025

2025

[19] [19]

A2visr: An active and adaptive ground–aerial localization system using visual inertial and single-range fusion,

S. Chen and W. Dong, “A2visr: An active and adaptive ground–aerial localization system using visual inertial and single-range fusion,”IEEE Trans. Ind. Electron., vol. 73, no. 5, pp. 7340–7349, 2026

2026

[20] [20]

Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities,

S. Hu, Z. Fang, Y . Deng, X. Chen, and Y . Fang, “Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities,”IEEE Wireless Commun., vol. 32, no. 5, pp. 228–234, 2025

2025

[21] [21]

Vehicle-road-cloud collaborative perception framework and key technologies: A review,

B. Gao, J. Liu, H. Zou, J. Chen, L. He, and K. Li, “Vehicle-road-cloud collaborative perception framework and key technologies: A review,” IEEE Trans. Intell. Transp. Syst., vol. 25, no. 12, pp. 19 295–19 318, 2024

2024

[22] [22]

Agc-drive: A large-scale dataset for real-world aerial-ground collaboration in driving scenarios,

Y . Hou, B. Zou, M. Zhang, S. Yang, Y . Zhang, J. Zhuo, S. Chen, J. Chen, and H. Ma, “Agc-drive: A large-scale dataset for real-world aerial-ground collaboration in driving scenarios,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 38, 2025

2025

[23] [23]

Coopercept: Cooperative perception for 3d object detection of autonomous vehicles,

Y . Zhang, B. Chen, J. Qin, F. Hu, and J. Hao, “Coopercept: Cooperative perception for 3d object detection of autonomous vehicles,”Drones, vol. 8, no. 6, p. 228, 2024

2024

[24] [24]

Research challenges and progress in the end-to-end v2x cooperative autonomous driving competition,

R. Hao, H. Yu, J. Zhong, C. Wang, J. Wang, Y . Kan, W. Yang, S. Fan, H. Yin, J. Qiu, Y . Mu, J. Sun, L. Chen, W. Zimmer, D. Zhang, S. Zhang, M. Schwager, P. Luo, and Z. Nie, “Research challenges and progress in the end-to-end v2x cooperative autonomous driving competition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025, pp. 1828–1839

2025

[25] [25]

Di-v2x: Learning domain-invariant representation for vehicle-infrastructure collaborative 3d object detection,

X. Li, J. Yin, W. Li, C. Xu, R. Yang, and J. Shen, “Di-v2x: Learning domain-invariant representation for vehicle-infrastructure collaborative 3d object detection,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 38, no. 4, 2024, pp. 3208–3215

2024

[26] [26]

Multi-task learning with multi-query transformer for dense prediction,

Y . Xu, X. Li, H. Yuan, Y . Yang, and L. Zhang, “Multi-task learning with multi-query transformer for dense prediction,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 2, pp. 1228–1240, 2024

2024

[27] [27]

Learning category- and instance-aware pixel embedding for fast panoptic segmentation,

N. Gao, Y . Shan, X. Zhao, and K. Huang, “Learning category- and instance-aware pixel embedding for fast panoptic segmentation,”IEEE Trans. Image Process., vol. 30, pp. 6013–6023, 2021

2021

[28] [28]

Mask ssd: An effective single-stage approach to object instance segmentation,

H. Zhang, Y . Tian, K. Wang, W. Zhang, and F.-Y . Wang, “Mask ssd: An effective single-stage approach to object instance segmentation,”IEEE Trans. Image Process., vol. 29, pp. 2078–2093, 2020

2078

[29] [29]

Mtsam: Multi-task fine- tuning for segment anything model,

X. Wang, Z. ZHUANG, F. YE, and Y . Zhang, “Mtsam: Multi-task fine- tuning for segment anything model,” inInt. Conf. Learn. Represent. (ICLR), Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 95 268–95 289

2025

[30] [30]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 24 185–24 198

2024

[31] [31]

Fedhca2: Towards hetero-client federated multi-task learning,

Y . Lu, S. Huang, Y . Yang, S. Sirejiding, Y . Ding, and H. Lu, “Fedhca2: Towards hetero-client federated multi-task learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 5599–5609

2024

[32] [32]

One framework to rule them all: Unifying multimodal tasks with llm neural-tuning,

H. Sun, Y . Song, J. Liu, J. Hu, Y .-W. Chen, and L. Lin, “One framework to rule them all: Unifying multimodal tasks with llm neural-tuning,” Pattern Recognit., vol. 171, p. 112275, 2026

2026

[33] [33]

Learning multiple tasks with multilinear relationship networks,

M. Long, Z. Cao, J. Wang, and P. S. Yu, “Learning multiple tasks with multilinear relationship networks,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017

2017

[34] [34]

Mtmlnet: Multi-task mutual learning network for infrared small target detection and segmentation,

B. Yang, F. Li, S. Zhao, W. Wang, J. Luo, H. Pu, M. Zhou, and Y . Pi, “Mtmlnet: Multi-task mutual learning network for infrared small target detection and segmentation,”IEEE Trans. Image Process., vol. 34, pp. 4414–4425, 2025

2025

[35] [35]

Dense pixel-level interpretation of dynamic scenes with video panoptic segmentation,

D. Kim, S. Woo, J.-Y . Lee, and I. S. Kweon, “Dense pixel-level interpretation of dynamic scenes with video panoptic segmentation,” IEEE Trans. Image Process., vol. 31, pp. 5383–5395, 2022

2022

[36] [36]

Instance motion tendency learning for video panoptic segmentation,

L. Wang, H. Liu, S. Zhou, W. Tang, and G. Hua, “Instance motion tendency learning for video panoptic segmentation,”IEEE Trans. Image Process., vol. 32, pp. 764–778, 2023

2023

[37] [37]

Bdd100k: A diverse driving dataset for heterogeneous multitask learning,

F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu, V . Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 2636–2645

2020

[38] [38]

Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection,

H. Yu, Y . Luo, M. Shu, Y . Huo, Z. Yang, Y . Shi, Z. Guo, H. Li, X. Hu, J. Yuanet al., “Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 21 361–21 370

2022

[39] [39]

Multi-query vehicle re-identification: Viewpoint-conditioned network, unified dataset and new metric,

A. Zheng, C. Zhang, C. Li, J. Tang, and C. Tan, “Multi-query vehicle re-identification: Viewpoint-conditioned network, unified dataset and new metric,”IEEE Trans. Image Process., vol. 32, pp. 5948–5960, 2023

2023

[40] [40]

V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting,

H. Yu, W. Yang, H. Ruan, Z. Yang, Y . Tang, X. Gao, X. Hao, Y . Shi, Y . Pan, N. Sunet al., “V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 5486–5495

2023

[41] [41]

Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark,

J. Wang, X. Cao, J. Zhong, Y . Zhang, Z. Han, H. Yu, C. Zhang, L. He, S. Xu, and J. Wang, “Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 12, 2026, pp. 9867–9875

2026

[42] [42]

Bayes error estimation using parzen and k-nn procedures,

K. Fukunaga and D. M. Hummels, “Bayes error estimation using parzen and k-nn procedures,”IEEE Trans. Pattern Anal. Mach. Intell., no. 5, pp. 634–643, 1987

1987

[43] [43]

Socialized learning: Making each other better through multi-agent collaboration,

X. Yao, Y . Wang, P. Zhu, W. Lin, J. Li, W. Li, and Q. Hu, “Socialized learning: Making each other better through multi-agent collaboration,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 235, 2024, pp. 56 927–56 945

2024

[44] [44]

Mutual information driven equivariant contrastive learning for 3d action representation learning,

L. Lin, J. Zhang, and J. Liu, “Mutual information driven equivariant contrastive learning for 3d action representation learning,”IEEE Trans. Image Process., vol. 33, pp. 1883–1897, 2024

2024

[45] [45]

A novel approach for effective multi-view clustering with information-theoretic perspective,

C. Cui, Y . Ren, J. Pu, J. Li, X. Pu, T. Wu, Y . Shi, and L. He, “A novel approach for effective multi-view clustering with information-theoretic perspective,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 44 847–44 859

2023

[46] [46]

Intern: A new learning paradigm towards general vision,

J. Shao, S. Chen, Y . Li, K. Wang, Z. Yin, Y . He, J. Teng, Q. Sun, M. Gao, J. Liuet al., “Intern: A new learning paradigm towards general vision,” arXiv preprint arXiv:2111.08687, 2021

work page arXiv 2021

[47] [47]

Tadformer: Task-adaptive dynamic transformer for efficient multi-task learning,

S. Baek, S. Lee, H. Jo, H. Choi, and D. Min, “Tadformer: Task-adaptive dynamic transformer for efficient multi-task learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 14 858–14 868

2025

[48] [48]

Few- shot incremental multi-modal learning via touch guidance and imaginary vision synthesis,

L. Wei, Y . Ma, Z. Lin, F. Wang, C. Jin, H. Zhao, and D. Chen, “Few- shot incremental multi-modal learning via touch guidance and imaginary vision synthesis,” inProc. Int. Joint Conf. Artif. Intell. (IJCAI), 2025, pp. 2045–2053

2025

[49] [49]

Bidirectional channel- selective semantic interaction for semi-supervised medical segmentation,

K. Huang, Y . Zhang, Y . Zhou, T. Xu, and T. Zhou, “Bidirectional channel- selective semantic interaction for semi-supervised medical segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 7, p. 5040–5048, Mar. 2026. 13

2026

[50] [50]

Decoding with structured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation,

F. Zhang, Z. Gu, and H. Wang, “Decoding with structured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 15, 2026, pp. 12 421–12 429

2026

[51] [51]

Mambaseg: Harnessing mamba for accurate and efficient image-event semantic segmentation,

F. Gu, Y . Li, X. Long, K. Ji, C. Chen, Q. Gu, and Z. Ni, “Mambaseg: Harnessing mamba for accurate and efficient image-event semantic segmentation,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 6, 2026, pp. 4302–4310

2026

[52] [52]

Semc: Structure- enhanced mixture-of-experts contrastive learning for ultrasound standard plane recognition,

Q. Cai, G. Yan, F. Zhang, C. Zhang, Z. Liuet al., “Semc: Structure- enhanced mixture-of-experts contrastive learning for ultrasound standard plane recognition,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 40, no. 4, 2026, pp. 2543–2551

2026