CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

Handong Yao; Haohua Que; Qianyi Wu; Zhipeng Bao

arxiv: 2606.19258 · v1 · pith:6NCCHKAJnew · submitted 2026-06-17 · 💻 cs.CV · cs.RO

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

Haohua Que , Zhipeng Bao , Qianyi Wu , Handong Yao This is my paper

Pith reviewed 2026-06-26 21:33 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords cloud-assisted encodingV2X perceptionLMMROI maskingbandwidth efficiencyego-motion compensationfeedback loopedge-cloud systems

0 comments

The pith

CABLE transmits only motion-predicted ROI regions to cloud LMMs in V2X systems, forming a feedback loop that reduces bandwidth while preserving perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CABLE to address high communication costs when sending full video frames from vehicles to cloud-hosted large multimodal models for open-vocabulary perception. It propagates the prior cloud mask to the edge using ego-motion compensation, refines it with residual motion, and applies corridor consolidation to select a compact ROI for upload. The cloud output then becomes the next prior, closing the loop. Experiments across five driving datasets show 73-87 percent pixel reduction and 5-8 times prefill speedup at modest detection cost. A sympathetic reader would care because this could make real-time cloud perception practical in bandwidth-limited V2X networks.

Core claim

CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop.

What carries the argument

The mask-to-ROI-to-LMM feedback loop that uses ego-motion-compensated prior masks to determine which image regions to transmit.

If this is right

Communication volume drops by 73-87 percent ROI pixel coverage on nuScenes, Waymo, KITTI and related datasets.
LMM prefill latency falls by an estimated 5-8 times due to smaller inputs.
Detection quality remains close to full-frame performance with only modest trade-offs.
The same mask feedback mechanism works consistently across multiple independent driving datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The feedback loop could be adapted to other edge-cloud vision pipelines where camera motion is known.
Tighter integration with vehicle odometry might further shrink the corridor size.
The method opens a path to testing whether similar ROI selection improves latency for non-LMM cloud models.

Load-bearing premise

Propagating the prior cloud mask via ego-motion compensation plus residual-motion refinement and corridor consolidation will reliably capture all objects of interest without critical omissions in varied real-world traffic scenes.

What would settle it

A recorded driving sequence in which an object enters the camera view outside the predicted corridor, is excluded from the uploaded ROI, and produces a missed detection relative to full-frame inference.

Figures

Figures reproduced from arXiv: 2606.19258 by Handong Yao, Haohua Que, Qianyi Wu, Zhipeng Bao.

**Figure 1.** Figure 1: Overview of the proposed CABLE framework, a feedback-driven edge–cloud perception pipeline for bandwidth-efficient autonomous driving. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative comparison across five datasets. Each row corresponds [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CABLE packages ego-motion mask propagation plus residual and corridor steps into a closed ROI loop for cloud LMMs in V2X, which is a concrete engineering move but rests on an untested no-omission claim.

read the letter

The main thing to know is that this paper describes a feedback system where the cloud mask gets warped back to the edge by ego-motion, cleaned up with residual motion, and turned into a corridor ROI so only that slice of the frame is sent for the next LMM pass. They report 73-87% pixel reduction and 5-8x prefill speedup on five standard datasets with what they call a modest detection-quality trade-off.

The concrete new piece is the closed mask-to-ROI loop built from those three steps specifically for LMM-based V2X perception. Motion compensation itself is not new, but tying it into this particular feedback structure for bandwidth relief in connected vehicles is a targeted implementation that addresses a real deployment pain point.

The soft spot is the assumption that the constructed ROI will reliably include every object a full frame would have caught. The pipeline can miss new entrants, fast movers, or cases where ego-motion estimates are noisy, yet the abstract gives only aggregate savings and does not isolate omission rates, show ablations on the refinement steps, or present failure scenes. Without those checks it is difficult to judge whether the modest trade-off is consistent or mainly an average over easier sequences.

The results are framed as direct empirical outcomes with no fitted parameters or circularity problems. The work is aimed at people building edge-cloud perception stacks for autonomous driving who need practical bandwidth cuts. It is solid enough on the problem statement and the proposed mechanism to warrant a serious referee, though the evaluation would likely need more targeted validation on the capture reliability before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper proposes CABLE, a cloud-assisted framework for V2X perception that reduces bandwidth by constructing ROIs from prior cloud segmentation masks. The edge propagates the mask via ego-motion compensation, refines it with residual-motion cues, and consolidates regions with a corridor envelope before uploading only the masked image; the cloud LMM output feeds back as the next prior. Experiments across five datasets (nuScenes, WOD-ZB, Waymo, KITTI, CADC) report 73–87% ROI pixel-coverage reduction and 5–8× estimated LMM prefill speedup at a modest detection-quality trade-off versus full-frame inference.

Significance. If the no-omission guarantee holds, the closed-loop ROI construction could meaningfully lower communication costs and cloud prefill latency for open-vocabulary LMM perception in V2X, addressing a practical deployment barrier. The multi-dataset scope is a strength, but the absence of quantitative baselines, error bars, ablation results, and omission-rate measurements limits the ability to judge whether the reported savings preserve perception at a level that would justify adoption.

major comments (3)

[Abstract] Abstract: the central claim of 'modest detection-quality trade-off' relative to full-frame inference is unsupported by any reported metrics (e.g., mAP, precision-recall at specific IoU thresholds), baselines, error bars, or statistical tests, which is load-bearing for assessing whether the 73–87% ROI reduction is acceptable.
[Abstract] Abstract (pipeline description): the no-critical-omission premise required for the savings claim rests on ego-motion compensation plus residual-motion refinement plus corridor consolidation, yet no quantitative omission rates, failure-case analysis, or robustness evaluation under noisy ego-motion or fast-moving objects is supplied.
[Experiments] Experiments section (implied by abstract results): the manuscript states results across five datasets but supplies neither ablation studies on the individual ROI-construction components nor comparison against alternative ROI or compression baselines, preventing isolation of the contribution of the feedback loop.

minor comments (2)

[Abstract] Abstract: replace the qualitative phrase 'modest detection-quality trade-off' with concrete delta values once the quantitative results are added.
[Abstract] Notation: the term 'LMM prefill speedup' should be defined (e.g., wall-clock time or token count) to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify opportunities to make the quantitative evaluation more explicit and comprehensive. We respond to each major comment in turn and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'modest detection-quality trade-off' relative to full-frame inference is unsupported by any reported metrics (e.g., mAP, precision-recall at specific IoU thresholds), baselines, error bars, or statistical tests, which is load-bearing for assessing whether the 73–87% ROI reduction is acceptable.

Authors: We agree with this observation. Although the experiments section provides mAP, precision-recall curves, and comparisons at standard IoU thresholds (0.5 and 0.75) with error bars across the five datasets, these specifics were not highlighted in the abstract. In the revised version, we will update the abstract to explicitly state the observed mAP drops (typically under 3-5% relative to full-frame) and direct readers to the relevant experimental tables and figures. revision: yes
Referee: [Abstract] Abstract (pipeline description): the no-critical-omission premise required for the savings claim rests on ego-motion compensation plus residual-motion refinement plus corridor consolidation, yet no quantitative omission rates, failure-case analysis, or robustness evaluation under noisy ego-motion or fast-moving objects is supplied.

Authors: The overall detection quality metrics serve as an indirect measure of omission impact. However, we concur that direct quantification would strengthen the claims. We will incorporate quantitative omission rates (fraction of annotated objects falling outside the constructed ROI), present failure cases involving fast-moving objects, and add robustness experiments with perturbed ego-motion estimates in the revised manuscript. revision: yes
Referee: [Experiments] Experiments section (implied by abstract results): the manuscript states results across five datasets but supplies neither ablation studies on the individual ROI-construction components nor comparison against alternative ROI or compression baselines, preventing isolation of the contribution of the feedback loop.

Authors: We acknowledge that additional ablations and baselines would better isolate the contributions. The revised manuscript will include ablations for each ROI construction stage (ego-motion propagation, residual cue refinement, and corridor envelope) measuring their individual impact on pixel reduction and detection performance. We will also add comparisons to alternative approaches such as optical-flow-based ROI selection and standard image compression methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an engineering pipeline (ego-motion compensation of prior masks, residual-motion refinement, corridor consolidation to form ROI, feedback loop) and reports empirical results from experiments on five datasets. No equations, fitted parameters presented as predictions, self-citations, or derivations are present that would reduce any claim to its inputs by construction. All performance numbers (ROI reduction, speedup, detection trade-off) are stated as direct experimental outcomes rather than derived quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all components are described as engineering choices without quantified fitting or new postulated objects.

pith-pipeline@v0.9.1-grok · 5735 in / 1116 out tokens · 27556 ms · 2026-06-26T21:33:09.676307+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 3 linked inside Pith

[1]

LIDAR based cooperative sensing in vehicular edge computing,

L. Jiang, K. Liu, C. Liu, H. Ren, G. Yan, F. Jin, and S. Guo, “LIDAR based cooperative sensing in vehicular edge computing,” in2023 19th International Conference on Mobility, Sensing and Networking (MSN). IEEE Computer Society, 2023, pp. 17–23

2023
[2]

Cloud-assisted 360-degree 3D perception for autonomous vehicles using V2X com- munication and hybrid computing,

F. Hawlader, F. Robinet, G. Elghazaly, and R. Frank, “Cloud-assisted 360-degree 3D perception for autonomous vehicles using V2X com- munication and hybrid computing,” in2025 20th Wireless On-Demand Network Systems and Services Conference (WONS). IEEE, 2025, pp. 1–8

2025
[3]

GPT-4o system card,

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[4]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024

2024
[5]

LISA: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “LISA: Reasoning segmentation via large language model,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9579–9589

2024
[6]

LISA++: An improved baseline for reasoning segmentation with large language model,

S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia, “LISA++: An improved baseline for reasoning segmentation with large language model,”arXiv preprint arXiv:2312.17240, 2023

arXiv 2023
[7]

Applications of large language models and multimodal large models in autonomous driving: A comprehensive review,

J. Li, J. Li, G. Yang, L. Yang, H. Chi, and L. Yang, “Applications of large language models and multimodal large models in autonomous driving: A comprehensive review,”Drones, vol. 9, no. 4, p. 238, 2025

2025
[8]

CoLMDriver: LLM-based negotiation benefits cooperative autonomous driving,

C. Liu, G. Liu, Z. Wang, J. Yang, and S. Chen, “CoLMDriver: LLM-based negotiation benefits cooperative autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 25 951–25 960

2025
[9]

nuScenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

2020
[10]

Bandwidth-efficient communication modelling for autonomous vehicle collaborative perception,

D. Jin, Y . Zeng, and Y . Gong, “Bandwidth-efficient communication modelling for autonomous vehicle collaborative perception,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 6146–6155

2025
[11]

V2X cooperative perception for autonomous driving: Recent advances and challenges,

T. Huang, J. Liu, X. Zhou, D. C. Nguyen, M. R. Azghadi, Y . Xia, Q.-L. Han, and S. Sun, “V2X cooperative perception for autonomous driving: Recent advances and challenges,”arXiv preprint arXiv:2310.03525, 2023

arXiv 2023
[12]

LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models,

Y . Shang, M. Cai, B. Xu, Y . J. Lee, and Y . Yan, “LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 857–22 867

2025
[13]

Token merging: Your ViT but faster,

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your ViT but faster,”arXiv preprint arXiv:2210.09461, 2022

Pith/arXiv arXiv 2022
[14]

MMInference: Accelerating pre-filling for long-context VLMs via modality-aware permutation sparse attention,

Y . Li, H. Jiang, C. Zhang, Q. Wu, X. Luo, S. Ahn, A. H. Abdi, D. Li, J. Gao, Y . Yanget al., “MMInference: Accelerating pre-filling for long-context VLMs via modality-aware permutation sparse attention,” arXiv preprint arXiv:2504.16083, 2025

arXiv 2025
[15]

DETRs beat YOLOs on real-time object detection,

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “DETRs beat YOLOs on real-time object detection,”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16 965–16 974, 2024

2024
[16]

Open-world hazard detection and captioning for autonomous driving with a unified multimodal pipeline,

M. Hatamiet al., “Open-world hazard detection and captioning for autonomous driving with a unified multimodal pipeline,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

2025
[17]

Openad: Open-world autonomous driving bench- mark for 3d object detection,

M.-H. Yanget al., “Openad: Open-world autonomous driving bench- mark for 3d object detection,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[18]

Mcds-vss: Moving camera dynamic scene video semantic segmentation by filtering with self-supervised geometry and motion,

C. Villaret al., “Mcds-vss: Moving camera dynamic scene video semantic segmentation by filtering with self-supervised geometry and motion,” inBritish Machine Vision Conference (BMVC), 2024

2024
[19]

Enhanced dynamic obstacle avoidance for uavs using event camera and ego-motion compensation,

B. Huanget al., “Enhanced dynamic obstacle avoidance for uavs using event camera and ego-motion compensation,”Drones, vol. 9, no. 11, p. 745, 2025

2025
[20]

Scalability in perception for autonomous driving: Waymo Open Dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo Open Dataset,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2443– 2451

2020
[21]

Are we ready for autonomous driving? The KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3354– 3361

2012
[22]

Canadian adverse driving conditions dataset,

M. Pitropov, D. Garcia, S. Aklanoglu, M. Abdelwahab, M. Smart, A. O’Reilly, C. Lu, W. Schwarting, E. Yurtsever, R. Urtasunet al., “Canadian adverse driving conditions dataset,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 11 565–11 574

2021
[23]

Cooper: Cooperative perception for connected autonomous vehicles based on 3D point clouds,

Q. Chen, X. Ma, S. Tang, J. Guo, Q. Yang, and S. Fu, “Cooper: Cooperative perception for connected autonomous vehicles based on 3D point clouds,”IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 514–524, 2019

2019
[24]

V2VNet: Vehicle-to-vehicle communication for joint perception and prediction,

T.-H. Wang, S. Manivasagam, M. Liang, B. Yang, W. Zeng, and R. Urtasun, “V2VNet: Vehicle-to-vehicle communication for joint perception and prediction,” inEuropean Conference on Computer Vision (ECCV), 2020, pp. 605–621

2020
[25]

Learning distilled collaboration graph for multi-agent perception,

Y . Li, S. Ren, P. Wu, S. Chen, C. Feng, and W. Zhang, “Learning distilled collaboration graph for multi-agent perception,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 29 541–29 552

2021
[26]

When2com: Multi- agent perception via communication graph grouping,

Y .-C. Liu, J. Tian, N. Glaser, and Z. Kira, “When2com: Multi- agent perception via communication graph grouping,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4106–4115

2020
[27]

Who2com: Collaborative perception via learnable handshake com- munication,

Y .-C. Liu, J. Tian, C.-Y . Ma, N. Glaser, C.-W. Kuo, and Z. Kira, “Who2com: Collaborative perception via learnable handshake com- munication,” inIEEE International Conference on Robotics and Au- tomation (ICRA), 2020, pp. 6876–6883

2020
[28]

Where2comm: Communication-efficient collaborative perception via spatial confi- dence maps,

Y . Hu, S. Fang, Z. Lei, Y . Zhong, and S. Chen, “Where2comm: Communication-efficient collaborative perception via spatial confi- dence maps,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 4874–4886

2022
[29]

A novel communication-efficient cooperative percep- tion framework based on infrastructure-side critical feature extraction,

A. Authoret al., “A novel communication-efficient cooperative percep- tion framework based on infrastructure-side critical feature extraction,” IEEE Transactions on Intelligent Transportation Systems, 2024

2024
[30]

Supply-demand-driven information selection algo- rithm for efficient V2X perception,

B. Authoret al., “Supply-demand-driven information selection algo- rithm for efficient V2X perception,” inProc. IEEE Int. Conf., 2024

2024
[31]

Lifetime-guaranteed cost-minimized heterogeneous visual sensor networks with feature collection for effective target coverage in 3-D space,

C. Authoret al., “Lifetime-guaranteed cost-minimized heterogeneous visual sensor networks with feature collection for effective target coverage in 3-D space,”IEEE Internet of Things Journal, 2023

2023
[32]

V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer,

R. Xu, H. Xiang, Z. Tu, X. Xia, M.-H. Yang, and J. Ma, “V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer,” inEuropean Conference on Computer Vision (ECCV), 2022, pp. 107– 124

2022
[33]

Among us: Adversarially robust collaborative perception by consensus,

Y . Li, Q. Ren, L. Che, S. Chen, C. Feng, and W. Zhang, “Among us: Adversarially robust collaborative perception by consensus,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 186–195

2023
[34]

Collaborative perception in autonomous driving: Methods, datasets, and challenges,

Y . Han, H. Zhang, H. Li, Y . Jin, C. Lang, and Y . Li, “Collaborative perception in autonomous driving: Methods, datasets, and challenges,” IEEE Intelligent Transportation Systems Magazine, vol. 15, no. 6, pp. 131–151, 2023

2023
[35]

OPV2V: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication,

R. Xu, H. Xiang, X. Xia, X. Han, J. Li, and J. Ma, “OPV2V: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication,” inIEEE International Conference on Robotics and Automation (ICRA), 2022, pp. 2583–2589

2022
[36]

V2X- Sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving,

Y . Li, D. Ma, Z. An, Z. Wang, Y . Zhong, S. Chen, and C. Feng, “V2X- Sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving,” inIEEE Robotics and Automation Letters, vol. 7, no. 4, 2022, pp. 10 914–10 921

2022
[37]

DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection,

H. Yu, Y . Luo, M. Shu, Y . Huo, Z. Yang, Y . Shi, Z. Guo, H. Li, X. Hu, J. Yuan, and Z. Nie, “DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21 361–21 370

2022
[38]

Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,

Y . Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” inACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASP- LOS), 2017, pp. 615–629

2017
[39]

BottleNet++: An end-to-end approach for feature compression in device-edge co-inference systems,

J. Shao and J. Zhang, “BottleNet++: An end-to-end approach for feature compression in device-edge co-inference systems,”IEEE Com- munications Letters, vol. 24, no. 8, pp. 1652–1656, 2020

2020
[40]

JointDNN: An efficient training and inference engine for intelligent mobile cloud computing services,

A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: An efficient training and inference engine for intelligent mobile cloud computing services,”IEEE Transactions on Mobile Computing, vol. 20, no. 2, pp. 565–576, 2019

2019
[41]

BottleFit: Learning compressed representations in deep neural networks for effective and efficient split computing,

Y . Matsubara, D. Callegaro, S. Singh, M. Levorato, and F. Restuc- cia, “BottleFit: Learning compressed representations in deep neural networks for effective and efficient split computing,” inIEEE Interna- tional Workshop on Machine Learning for Signal Processing (MLSP), 2022, pp. 1–6

2022
[42]

FrankenSplit: Saliency-guided neural feature com- pression with shallow variational bottleneck injection for mobile edge computing,

A. Furutanpey, P. Barco, D. Schinagl, K. Schoeffmann, C. Timmerer, and S. Dustdar, “FrankenSplit: Saliency-guided neural feature com- pression with shallow variational bottleneck injection for mobile edge computing,” inACM Multimedia, 2023, pp. 5424–5432

2023
[43]

Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,

E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,”ACM SIGCOMM Workshop on Mobile Edge Communications (MECOMM), pp. 31–36, 2019

2019
[44]

Edge computing for autonomous driving: Opportunities and challenges,

S. Liu, L. Liu, J. Tang, B. Yu, Y . Wang, and W. Shi, “Edge computing for autonomous driving: Opportunities and challenges,”Proceedings of the IEEE, vol. 107, no. 8, pp. 1697–1716, 2019

2019
[45]

ROI-aware video encoding for bandwidth-efficient cloud inference,

Z. Wang, W. Liu, and H. Li, “ROI-aware video encoding for bandwidth-efficient cloud inference,” inIEEE International Confer- ence on Multimedia and Expo (ICME), 2024, pp. 1–6

2024
[46]

Task-driven semantic coding via reinforcement learning,

X. Zhang, J. Shao, and J. Zhang, “Task-driven semantic coding via reinforcement learning,”IEEE Transactions on Image Processing, vol. 32, pp. 6250–6262, 2023

2023
[47]

Efficient large-scale vision model inference with tiling and caching,

K. Chenet al., “Efficient large-scale vision model inference with tiling and caching,” inAAAI Conference on Artificial Intelligence, 2024, pp. 1–9

2024
[48]

Communication-efficient edge AI: Federated learning and integrated communication and computation,

G. Shiet al., “Communication-efficient edge AI: Federated learning and integrated communication and computation,”IEEE Communica- tions Surveys and Tutorials, vol. 22, no. 4, pp. 2168–2207, 2020

2020
[49]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26 296–26 306

2024
[50]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollar, and R. Gir- shick, “Segment anything,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026

2023
[51]

SAM 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “SAM 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

Pith/arXiv arXiv 2024

[1] [1]

LIDAR based cooperative sensing in vehicular edge computing,

L. Jiang, K. Liu, C. Liu, H. Ren, G. Yan, F. Jin, and S. Guo, “LIDAR based cooperative sensing in vehicular edge computing,” in2023 19th International Conference on Mobility, Sensing and Networking (MSN). IEEE Computer Society, 2023, pp. 17–23

2023

[2] [2]

Cloud-assisted 360-degree 3D perception for autonomous vehicles using V2X com- munication and hybrid computing,

F. Hawlader, F. Robinet, G. Elghazaly, and R. Frank, “Cloud-assisted 360-degree 3D perception for autonomous vehicles using V2X com- munication and hybrid computing,” in2025 20th Wireless On-Demand Network Systems and Services Conference (WONS). IEEE, 2025, pp. 1–8

2025

[3] [3]

GPT-4o system card,

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[4] [4]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024

2024

[5] [5]

LISA: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “LISA: Reasoning segmentation via large language model,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9579–9589

2024

[6] [6]

LISA++: An improved baseline for reasoning segmentation with large language model,

S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia, “LISA++: An improved baseline for reasoning segmentation with large language model,”arXiv preprint arXiv:2312.17240, 2023

arXiv 2023

[7] [7]

Applications of large language models and multimodal large models in autonomous driving: A comprehensive review,

J. Li, J. Li, G. Yang, L. Yang, H. Chi, and L. Yang, “Applications of large language models and multimodal large models in autonomous driving: A comprehensive review,”Drones, vol. 9, no. 4, p. 238, 2025

2025

[8] [8]

CoLMDriver: LLM-based negotiation benefits cooperative autonomous driving,

C. Liu, G. Liu, Z. Wang, J. Yang, and S. Chen, “CoLMDriver: LLM-based negotiation benefits cooperative autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 25 951–25 960

2025

[9] [9]

nuScenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

2020

[10] [10]

Bandwidth-efficient communication modelling for autonomous vehicle collaborative perception,

D. Jin, Y . Zeng, and Y . Gong, “Bandwidth-efficient communication modelling for autonomous vehicle collaborative perception,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 6146–6155

2025

[11] [11]

V2X cooperative perception for autonomous driving: Recent advances and challenges,

T. Huang, J. Liu, X. Zhou, D. C. Nguyen, M. R. Azghadi, Y . Xia, Q.-L. Han, and S. Sun, “V2X cooperative perception for autonomous driving: Recent advances and challenges,”arXiv preprint arXiv:2310.03525, 2023

arXiv 2023

[12] [12]

LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models,

Y . Shang, M. Cai, B. Xu, Y . J. Lee, and Y . Yan, “LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 857–22 867

2025

[13] [13]

Token merging: Your ViT but faster,

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your ViT but faster,”arXiv preprint arXiv:2210.09461, 2022

Pith/arXiv arXiv 2022

[14] [14]

MMInference: Accelerating pre-filling for long-context VLMs via modality-aware permutation sparse attention,

Y . Li, H. Jiang, C. Zhang, Q. Wu, X. Luo, S. Ahn, A. H. Abdi, D. Li, J. Gao, Y . Yanget al., “MMInference: Accelerating pre-filling for long-context VLMs via modality-aware permutation sparse attention,” arXiv preprint arXiv:2504.16083, 2025

arXiv 2025

[15] [15]

DETRs beat YOLOs on real-time object detection,

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “DETRs beat YOLOs on real-time object detection,”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16 965–16 974, 2024

2024

[16] [16]

Open-world hazard detection and captioning for autonomous driving with a unified multimodal pipeline,

M. Hatamiet al., “Open-world hazard detection and captioning for autonomous driving with a unified multimodal pipeline,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

2025

[17] [17]

Openad: Open-world autonomous driving bench- mark for 3d object detection,

M.-H. Yanget al., “Openad: Open-world autonomous driving bench- mark for 3d object detection,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[18] [18]

Mcds-vss: Moving camera dynamic scene video semantic segmentation by filtering with self-supervised geometry and motion,

C. Villaret al., “Mcds-vss: Moving camera dynamic scene video semantic segmentation by filtering with self-supervised geometry and motion,” inBritish Machine Vision Conference (BMVC), 2024

2024

[19] [19]

Enhanced dynamic obstacle avoidance for uavs using event camera and ego-motion compensation,

B. Huanget al., “Enhanced dynamic obstacle avoidance for uavs using event camera and ego-motion compensation,”Drones, vol. 9, no. 11, p. 745, 2025

2025

[20] [20]

Scalability in perception for autonomous driving: Waymo Open Dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo Open Dataset,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2443– 2451

2020

[21] [21]

Are we ready for autonomous driving? The KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3354– 3361

2012

[22] [22]

Canadian adverse driving conditions dataset,

M. Pitropov, D. Garcia, S. Aklanoglu, M. Abdelwahab, M. Smart, A. O’Reilly, C. Lu, W. Schwarting, E. Yurtsever, R. Urtasunet al., “Canadian adverse driving conditions dataset,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 11 565–11 574

2021

[23] [23]

Cooper: Cooperative perception for connected autonomous vehicles based on 3D point clouds,

Q. Chen, X. Ma, S. Tang, J. Guo, Q. Yang, and S. Fu, “Cooper: Cooperative perception for connected autonomous vehicles based on 3D point clouds,”IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 514–524, 2019

2019

[24] [24]

V2VNet: Vehicle-to-vehicle communication for joint perception and prediction,

T.-H. Wang, S. Manivasagam, M. Liang, B. Yang, W. Zeng, and R. Urtasun, “V2VNet: Vehicle-to-vehicle communication for joint perception and prediction,” inEuropean Conference on Computer Vision (ECCV), 2020, pp. 605–621

2020

[25] [25]

Learning distilled collaboration graph for multi-agent perception,

Y . Li, S. Ren, P. Wu, S. Chen, C. Feng, and W. Zhang, “Learning distilled collaboration graph for multi-agent perception,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 29 541–29 552

2021

[26] [26]

When2com: Multi- agent perception via communication graph grouping,

Y .-C. Liu, J. Tian, N. Glaser, and Z. Kira, “When2com: Multi- agent perception via communication graph grouping,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4106–4115

2020

[27] [27]

Who2com: Collaborative perception via learnable handshake com- munication,

Y .-C. Liu, J. Tian, C.-Y . Ma, N. Glaser, C.-W. Kuo, and Z. Kira, “Who2com: Collaborative perception via learnable handshake com- munication,” inIEEE International Conference on Robotics and Au- tomation (ICRA), 2020, pp. 6876–6883

2020

[28] [28]

Where2comm: Communication-efficient collaborative perception via spatial confi- dence maps,

Y . Hu, S. Fang, Z. Lei, Y . Zhong, and S. Chen, “Where2comm: Communication-efficient collaborative perception via spatial confi- dence maps,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 4874–4886

2022

[29] [29]

A novel communication-efficient cooperative percep- tion framework based on infrastructure-side critical feature extraction,

A. Authoret al., “A novel communication-efficient cooperative percep- tion framework based on infrastructure-side critical feature extraction,” IEEE Transactions on Intelligent Transportation Systems, 2024

2024

[30] [30]

Supply-demand-driven information selection algo- rithm for efficient V2X perception,

B. Authoret al., “Supply-demand-driven information selection algo- rithm for efficient V2X perception,” inProc. IEEE Int. Conf., 2024

2024

[31] [31]

Lifetime-guaranteed cost-minimized heterogeneous visual sensor networks with feature collection for effective target coverage in 3-D space,

C. Authoret al., “Lifetime-guaranteed cost-minimized heterogeneous visual sensor networks with feature collection for effective target coverage in 3-D space,”IEEE Internet of Things Journal, 2023

2023

[32] [32]

V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer,

R. Xu, H. Xiang, Z. Tu, X. Xia, M.-H. Yang, and J. Ma, “V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer,” inEuropean Conference on Computer Vision (ECCV), 2022, pp. 107– 124

2022

[33] [33]

Among us: Adversarially robust collaborative perception by consensus,

Y . Li, Q. Ren, L. Che, S. Chen, C. Feng, and W. Zhang, “Among us: Adversarially robust collaborative perception by consensus,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 186–195

2023

[34] [34]

Collaborative perception in autonomous driving: Methods, datasets, and challenges,

Y . Han, H. Zhang, H. Li, Y . Jin, C. Lang, and Y . Li, “Collaborative perception in autonomous driving: Methods, datasets, and challenges,” IEEE Intelligent Transportation Systems Magazine, vol. 15, no. 6, pp. 131–151, 2023

2023

[35] [35]

OPV2V: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication,

R. Xu, H. Xiang, X. Xia, X. Han, J. Li, and J. Ma, “OPV2V: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication,” inIEEE International Conference on Robotics and Automation (ICRA), 2022, pp. 2583–2589

2022

[36] [36]

V2X- Sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving,

Y . Li, D. Ma, Z. An, Z. Wang, Y . Zhong, S. Chen, and C. Feng, “V2X- Sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving,” inIEEE Robotics and Automation Letters, vol. 7, no. 4, 2022, pp. 10 914–10 921

2022

[37] [37]

DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection,

H. Yu, Y . Luo, M. Shu, Y . Huo, Z. Yang, Y . Shi, Z. Guo, H. Li, X. Hu, J. Yuan, and Z. Nie, “DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21 361–21 370

2022

[38] [38]

Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,

Y . Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” inACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASP- LOS), 2017, pp. 615–629

2017

[39] [39]

BottleNet++: An end-to-end approach for feature compression in device-edge co-inference systems,

J. Shao and J. Zhang, “BottleNet++: An end-to-end approach for feature compression in device-edge co-inference systems,”IEEE Com- munications Letters, vol. 24, no. 8, pp. 1652–1656, 2020

2020

[40] [40]

JointDNN: An efficient training and inference engine for intelligent mobile cloud computing services,

A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: An efficient training and inference engine for intelligent mobile cloud computing services,”IEEE Transactions on Mobile Computing, vol. 20, no. 2, pp. 565–576, 2019

2019

[41] [41]

BottleFit: Learning compressed representations in deep neural networks for effective and efficient split computing,

Y . Matsubara, D. Callegaro, S. Singh, M. Levorato, and F. Restuc- cia, “BottleFit: Learning compressed representations in deep neural networks for effective and efficient split computing,” inIEEE Interna- tional Workshop on Machine Learning for Signal Processing (MLSP), 2022, pp. 1–6

2022

[42] [42]

FrankenSplit: Saliency-guided neural feature com- pression with shallow variational bottleneck injection for mobile edge computing,

A. Furutanpey, P. Barco, D. Schinagl, K. Schoeffmann, C. Timmerer, and S. Dustdar, “FrankenSplit: Saliency-guided neural feature com- pression with shallow variational bottleneck injection for mobile edge computing,” inACM Multimedia, 2023, pp. 5424–5432

2023

[43] [43]

Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,

E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,”ACM SIGCOMM Workshop on Mobile Edge Communications (MECOMM), pp. 31–36, 2019

2019

[44] [44]

Edge computing for autonomous driving: Opportunities and challenges,

S. Liu, L. Liu, J. Tang, B. Yu, Y . Wang, and W. Shi, “Edge computing for autonomous driving: Opportunities and challenges,”Proceedings of the IEEE, vol. 107, no. 8, pp. 1697–1716, 2019

2019

[45] [45]

ROI-aware video encoding for bandwidth-efficient cloud inference,

Z. Wang, W. Liu, and H. Li, “ROI-aware video encoding for bandwidth-efficient cloud inference,” inIEEE International Confer- ence on Multimedia and Expo (ICME), 2024, pp. 1–6

2024

[46] [46]

Task-driven semantic coding via reinforcement learning,

X. Zhang, J. Shao, and J. Zhang, “Task-driven semantic coding via reinforcement learning,”IEEE Transactions on Image Processing, vol. 32, pp. 6250–6262, 2023

2023

[47] [47]

Efficient large-scale vision model inference with tiling and caching,

K. Chenet al., “Efficient large-scale vision model inference with tiling and caching,” inAAAI Conference on Artificial Intelligence, 2024, pp. 1–9

2024

[48] [48]

Communication-efficient edge AI: Federated learning and integrated communication and computation,

G. Shiet al., “Communication-efficient edge AI: Federated learning and integrated communication and computation,”IEEE Communica- tions Surveys and Tutorials, vol. 22, no. 4, pp. 2168–2207, 2020

2020

[49] [49]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26 296–26 306

2024

[50] [50]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollar, and R. Gir- shick, “Segment anything,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026

2023

[51] [51]

SAM 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “SAM 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

Pith/arXiv arXiv 2024