Tiny Collaborative Inference for Occlusion-Robust Object Detection

Chieh-Tung Cheng; Eiman Kanjo; Mustafa Aslanov

arxiv: 2606.02894 · v2 · pith:IRDPBDX4new · submitted 2026-06-01 · 💻 cs.CV

Tiny Collaborative Inference for Occlusion-Robust Object Detection

Chieh-Tung Cheng , Mustafa Aslanov , Eiman Kanjo This is my paper

Pith reviewed 2026-06-28 14:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords occlusion-robust object detectioncollaborative inferenceweighted boxes fusionedge AIMCUNetYOLOv2multi-view fusiontiny hardware

0 comments

The pith

Decision-level fusion with weighted boxes outperforms feature fusion for occlusion-robust detection on devices under 1 MB SRAM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests ways to combine detections from multiple tiny cameras so that partially hidden objects are still found reliably on hardware too small for heavy models. It compares early fusion of internal features against late fusion of final bounding boxes using weighted boxes fusion, and shows the late method gives higher accuracy in every occlusion test while using little extra data transfer. The work demonstrates this can run directly on the devices themselves without a central host, producing more frames with detections than a single unit alone. A sympathetic reader would care because search-and-rescue sensors often operate in cluttered environments where one view is blocked and communication must stay minimal.

Core claim

The central claim is that decision-level fusion via Weighted Boxes Fusion outperforms feature-level fusion under all tested occlusion conditions on MCUNet-YOLOv2 models quantized for less than 1 MB SRAM, with gains reaching 0.2736 mAP in two-view asymmetric cases and 0.3827 mAP when extended to three views, at roughly 1.3 KB communication per exchange, and that this fusion executes on-device on Coral Dev Board Micro units with negligible added energy, increasing autonomous coverage from 47 to 61 frames in a 301.9-second session.

What carries the argument

Weighted Boxes Fusion (WBF), which merges bounding-box outputs from separate detectors by weighting them according to their confidence scores.

If this is right

Three-view fusion adds further accuracy at only modest extra communication cost.
On-device WBF removes the need for a host computer and raises the fraction of frames that contain detections.
The same fusion step works across both USB-relay and Wi-Fi peer-to-peer setups.
Federated learning remains possible but shows limited gains when data across nodes are non-iid.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar late-fusion logic could apply to other bandwidth-limited multi-sensor tasks such as distributed tracking.
Energy measurements on longer missions would show whether the coverage gain translates into extended battery runtime.
Replacing the current backbone with newer tiny detectors might lower the baseline memory requirement even further.

Load-bearing premise

The specific occlusion patterns and datasets used produce accuracy and energy gains that represent real search-and-rescue conditions on these boards.

What would settle it

A field deployment in actual search-and-rescue terrain where the measured mAP improvement from two- or three-view WBF falls below 0.1 and coverage gain disappears.

Figures

Figures reproduced from arXiv: 2606.02894 by Chieh-Tung Cheng, Eiman Kanjo, Mustafa Aslanov.

**Figure 2.** Figure 2: Illustration of common network topologies used in DFL [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the system architecture: (Left) model pre-training with MCUNet backbone and YOLOv2 head; (Middle) collabora [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Backbone structure of MCUNet map produced by the second convolutional layer in the detection head. The combined representation compensates for the coarse resolution of the final layer, which is effective for large objects but insufficient for detecting small ones. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: YOLOv2 detection head used in this work [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-view feature-level fusion pipeline [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Multi-view decision-level fusion pipeline [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Loss curve for fine-tuning MCUNet-YOLOv2 with [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Trade-off between accuracy and FLOPs under varying input resolutions. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Backbone comparison: MCUNet, MobileNetV2, and ResNet-18. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Feature-level vs. decision-level fusion accuracy across occlusion pairs. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: WBF (red) and baseline (blue) precision–recall comparison (view 1). [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: WBF (red) and baseline (blue) precision–recall comparison (view 2). [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Example of confidence averaging in WBF. achieving improvements of +0.1538, +0.2124, and +0.1815 mAP across the three views. These results demonstrate that three-view collaborative inference remains effective and robust even under severe occlusion. In [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Two-view versus three-view fusion accuracy across occlusion triplets. [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Accuracy versus communication cost for two-view and three-view fusion. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Training loss curve of FedAvg over successive rounds. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

read the original abstract

Edge AI nodes for search and rescue are increasingly expected to run computer vision locally, yet ultra-low-end hardware imposes hard constraints on memory, compute, and inter-device communication. This work addresses occlusion-robust object detection on devices with less than 1 MB SRAM by combining an MCUNet backbone, a YOLOv2 detection head, and Lite quantisation. Two collaborative inference strategies are evaluated: feature-level fusion, concatenating intermediate feature maps, and decision-level fusion via Weighted Boxes Fusion (WBF). WBF outperforms feature-level fusion under all tested occlusion conditions, yielding gains of up to +0.2736 mAP in asymmetric scenarios. Extending fusion to three views improves accuracy further (up to +0.3827 mAP) at modest communication overhead (~1.3 KB per exchange). Hardware experiments progress from a host-assisted USB-relay baseline to a Wi-Fi peer-to-peer deployment on two Coral Dev Board Micro units, where WBF executes on-device with negligible communication energy relative to inference. In a 301.9 s autonomous session of 108 frames, fused output is produced on 61 frames versus 47 for a single board - a coverage gain of +29.8%. A decentralised federated learning feasibility note is included but not treated as a primary result, as performance remains limited under non-iid data. The results support decision-level fusion as a viable option for improving occlusion robustness in small-scale edge object detection, including host-free multi-board operation on ultra-low-end hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WBF on MCUNet+YOLOv2 gives measurable mAP and coverage gains on <1MB SRAM Coral boards, but the negligible-energy claim for Wi-Fi exchanges rests on assertion rather than measured mJ.

read the letter

The main takeaway is that decision-level fusion via WBF beats feature concatenation for occlusion robustness in this exact low-SRAM multi-device setup, with reported gains up to +0.27 mAP and a +30% coverage lift in a 108-frame session. They actually ran the fusion on two Coral Dev Board Micro units over Wi-Fi P2P after starting from a USB baseline.

What stands out is the concrete hardware progression and the three-view numbers. They kept everything under 1 MB SRAM with Lite quantization, measured on-device execution, and showed that adding a third view adds only ~1.3 KB per exchange while improving results further. That combination on real constrained boards is not something already in the literature from the abstract.

The soft spot is the energy claim. The paper states communication energy is negligible relative to inference for the 301.9 s run, yet supplies no power traces or mJ-per-exchange figures. Without those, the practicality argument for host-free operation stays partly unverified. Dataset details, exact occlusion synthesis, and statistical significance of the mAP deltas are also thin in the provided text, which limits how far the gains can be generalized to other search-and-rescue scenes.

This is for people building multi-device edge vision systems under tight memory limits, not for core algorithm theorists. The empirical work is honest and the hardware validation is real, so it clears the bar for peer review even if the energy section needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper evaluates two collaborative inference strategies—feature-level fusion and decision-level fusion via Weighted Boxes Fusion (WBF)—for occlusion-robust object detection on ultra-low-end edge devices (<1 MB SRAM) using an MCUNet backbone with YOLOv2 head and Lite quantization. It reports that WBF outperforms feature-level fusion under tested occlusion conditions (gains up to +0.2736 mAP in asymmetric cases), with further gains from three-view fusion (+0.3827 mAP) at ~1.3 KB overhead per exchange. Hardware experiments on Coral Dev Board Micro units progress from USB-relay to Wi-Fi P2P, claiming on-device WBF execution with negligible communication energy and a +29.8% coverage gain (61 vs. 47 frames) in a 301.9 s / 108-frame autonomous session. A brief federated learning note is included but not central.

Significance. If the hardware claims hold, the work supplies concrete empirical evidence that decision-level fusion can improve occlusion robustness and coverage in multi-view setups on severely memory-constrained devices, with modest communication cost. The reported mAP deltas and coverage numbers from real hardware runs are a strength; the approach could be relevant for search-and-rescue edge AI if energy and dataset details are clarified.

major comments (1)

[Abstract / hardware experiments] Abstract and hardware experiments section: the central claim that WBF on-device execution incurs 'negligible communication energy relative to inference' in the Wi-Fi P2P Coral Dev Board Micro deployment is unsupported by any quantitative measurements (e.g., mJ per inference vs. per 1.3 KB exchange, or power traces). This directly weakens the hardware-practicality half of the contribution for <1 MB SRAM devices.

minor comments (1)

[Abstract / results] The abstract and results mention specific datasets and occlusion generation methods but do not provide sufficient detail on exact occlusion synthesis procedure or statistical significance tests for the mAP deltas.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the hardware energy claim. We address it directly below.

read point-by-point responses

Referee: [Abstract / hardware experiments] Abstract and hardware experiments section: the central claim that WBF on-device execution incurs 'negligible communication energy relative to inference' in the Wi-Fi P2P Coral Dev Board Micro deployment is unsupported by any quantitative measurements (e.g., mJ per inference vs. per 1.3 KB exchange, or power traces). This directly weakens the hardware-practicality half of the contribution for <1 MB SRAM devices.

Authors: We agree that the manuscript provides no direct quantitative energy measurements (mJ, power traces, or per-inference vs. per-exchange comparisons) to substantiate the 'negligible communication energy' phrasing. The statement was based on the modest payload size (~1.3 KB) and the known high compute cost of MCUNet inference on the target platform, but this remains a qualitative inference rather than an empirically measured result. In the revised version we will (1) remove or qualify the unqualified 'negligible' wording in both the abstract and hardware-experiments section, (2) explicitly note the absence of direct energy profiling, and (3) add a short discussion of the data-size argument together with a reference to typical Wi-Fi energy costs on similar Cortex-M devices. These changes will make the hardware-practicality claims more precise while preserving the reported coverage and latency numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential predictions

full rationale

The paper evaluates two fusion strategies (feature-level vs. WBF decision-level) via direct mAP measurements on occlusion datasets and hardware timing/energy on Coral Dev Board Micro units. No equations, fitted parameters, or derivation chains are present; results are reported as measured outputs (e.g., +0.2736 mAP, 1.3 KB overhead, 301.9 s session coverage). The reader's assessment of score 1.0 is consistent with the absence of any load-bearing steps that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is an empirical comparison of standard techniques on constrained hardware.

pith-pipeline@v0.9.1-grok · 5807 in / 1168 out tokens · 24864 ms · 2026-06-28T14:54:59.147123+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 18 canonical work pages

[1]

The Internet of Things: Catching Up to an Accelerating Opportunity

Chui M, Collins M, and Patel M. The Internet of Things: Catching Up to an Accelerating Opportunity. 2021. Accessed: 2025-08-24. Available from: https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/iot-value-set-to-accelerate-through-2030-where-and-how-to-capture-it

2021
[2]

Deep learning with edge computing: A review

Chen J and Ran X. Deep learning with edge computing: A review. Proceedings of the IEEE 2019; 107:1655–74. doi: 10.1109/JPROC.2019.2921977

work page doi:10.1109/jproc.2019.2921977 2019
[3]

Deep learning for edge computing applications: A state-of-the-art survey

Wang F, Zhang M, Wang X, Ma X, and Liu J. Deep learning for edge computing applications: A state-of-the-art survey. IEEE Access 2020; 8:58322–36. doi: 10.1109/ACCESS.2020.2982411

work page doi:10.1109/access.2020.2982411 2020
[4]

A survey of methods for low-power deep learning and computer vision

Goel A, Tung C, Lu YH, and Thiruvathukal GK. A survey of methods for low-power deep learning and computer vision. In: 2020 IEEE 6th World Forum on Internet of Things (WF-IoT). IEEE; 2020:1–6

2020
[5]

Making accurate object detection at the edge: review and new approach

Huang Z, Yang S, Zhou M, Gong Z, Abusorrah A, Lin C, and Huang Z. Making accurate object detection at the edge: review and new approach. Artificial Intelligence Review 2022; 55:2245–74. doi: 10.1007/s10462-021-10059-3

work page doi:10.1007/s10462-021-10059-3 2022
[6]

Human detection from unmanned aerial vehicles’ images for search and rescue missions: A state-of-the-art review

Bany Abdelnabi AA and Rabadi G. Human detection from unmanned aerial vehicles’ images for search and rescue missions: A state-of-the-art review. IEEE Access 2024; 12:152009–35. doi: 10.1109/ACCESS.2024.3479988

work page doi:10.1109/access.2024.3479988 2024
[7]

UAV-based real-time survivor detection system in post-disaster search and rescue operations

Dong J, Ota K, and Dong M. UAV-based real-time survivor detection system in post-disaster search and rescue operations. IEEE Journal on Miniaturization for Air and Space Systems 2021; 2:209–19. doi: 10.1109/JMASS.2021.3083659

work page doi:10.1109/jmass.2021.3083659 2021
[8]

A review of occluded objects detection in real complex scenarios for autonomous driving

Ruan J, Cui H, Huang Y, Li T, Wu C, and Zhang K. A review of occluded objects detection in real complex scenarios for autonomous driving. Green Energy and Intelligent Transportation 2023; 2:100092. doi: 10.1016/j.geits.2023.100092

work page doi:10.1016/j.geits.2023.100092 2023
[9]

Lightweight deep learning for resource-constrained environments: A survey

Liu HI, Galindo M, Xie H, Wong LK, Shuai HH, Li YH, and Cheng WH. Lightweight deep learning for resource-constrained environments: A survey. ACM Computing Surveys 2024; 56(10):Article 267. doi: 10.1145/3657282

work page doi:10.1145/3657282 2024
[10]

MCUNet: Tiny Deep Learning on IoT Devices

Lin J, Chen WM, Lin Y, Cohn J, Gan C, and Han S. MCUNet: Tiny Deep Learning on IoT Devices. arXiv 2020. arXiv:2007.10319 [cs.CV]. Available from: https://arxiv.org/abs/2007.10319

arXiv 2020
[11]

MCUNetV2: Memory-Efficient Patch-Based Inference for Tiny Deep Learning

Lin J, Chen WM, Cai H, Gan C, and Han S. MCUNetV2: Memory-Efficient Patch-Based Inference for Tiny Deep Learning. arXiv 2021. arXiv:2110.15352 [cs.CV]. Available from: https://arxiv.org/abs/2110.15352

arXiv 2021
[12]

Robustness of object recognition under extreme occlusion in humans and computational models

Zhu H, Tang P, Park J, Park S, and Yuille A. Robustness of object recognition under extreme occlusion in humans and computational models. arXiv
[13]

Available from: https://arxiv.org/abs/1905.04598

arXiv:1905.04598 [cs.CV]. Available from: https://arxiv.org/abs/1905.04598

Pith/arXiv arXiv 1905
[14]

Occlusion handling in generic object detection: A review

Saleh K, Szénási S, and Vámossy Z. Occlusion handling in generic object detection: A review. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI). IEEE; 2021:477–84

2021
[15]

Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion

Kortylewski A, He J, Liu Q, and Yuille AL. Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:8940–49

2020
[16]

Robust object detection under occlusion with context-aware CompositionalNets

Wang A, Sun Y, Kortylewski A, and Yuille AL. Robust object detection under occlusion with context-aware CompositionalNets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020:12645–54. Available from: https://openaccess.thecvf.com/content_ CVPR_2020/html/Wang_Robust_Object_Detection_Under_Occlusion_With_Conte...

2020
[17]

DeepVoting: A robust and explainable deep network for semantic part detection under partial occlusion

Zhang Z, Xie C, Wang J, Xie L, and Yuille AL. DeepVoting: A robust and explainable deep network for semantic part detection under partial occlusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:1372–80

2018
[18]

Multiview objects recognition using deep learning-based Wrap-CNN with voting scheme

Balamurugan D, Aravinth SS, Reddy PCS, Rupani A, and Manikandan A. Multiview objects recognition using deep learning-based Wrap-CNN with voting scheme. Neural Processing Letters 2022; 54(3):1495–521. doi: 10.1007/s11063-021-10679-4

work page doi:10.1007/s11063-021-10679-4 2022
[19]

Edge-device collaborative computing for multi-view classification

Palena M, Cerquitelli T, and Chiasserini CF. Edge-device collaborative computing for multi-view classification. Computer Networks 2024; 254:110823. doi: 10.1016/j.comnet.2024.110823

work page doi:10.1016/j.comnet.2024.110823 2024
[20]

Multimodal machine learning: A survey and taxonomy

Baltrušaitis T, Ahuja C, and Morency LP. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 2019; 41(2):423–43. doi: 10.1109/TPAMI.2018.2798607

work page doi:10.1109/tpami.2018.2798607 2019
[21]

Multimodal fusion for multimedia analysis: A survey

Atrey PK, Hossain MA, El Saddik A, and Kankanhalli MS. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 2010; 16:345–79

2010
[22]

Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

Boulahia SY, Amamra A, Madi MR, and Daikh S. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 2021; 32(6):121. doi: 10.1007/s00138-021-01249-8

work page doi:10.1007/s00138-021-01249-8 2021
[23]

Multi-view object detection based on deep learning

Tang C, Ling Y, Yang X, Jin W, and Zheng C. Multi-view object detection based on deep learning. Applied Sciences 2018; 8(9):1423. doi: 10.3390/app8091423

work page doi:10.3390/app8091423 2018
[24]

Cross-Domain Federated Object Detection

Su S, Li B, Zhang C, Yang M, and Xue X. Cross-Domain Federated Object Detection. In: 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE; 2023:1469–74. doi: 10.1109/ICME55011.2023.00254. Available from: http://dx.doi.org/10.1109/ICME55011.2023.00254

work page doi:10.1109/icme55011.2023.00254 2023
[25]

Weighted boxes fusion: Ensembling boxes from different object detection models

Solovyev R, Wang W, and Gabruseva T. Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing 2021; 107:104117. doi: 10.1016/j.imavis.2021.104117. Available from: http://dx.doi.org/10.1016/j.imavis.2021.104117

work page doi:10.1016/j.imavis.2021.104117 2021
[26]

Deep models for multi-view 3D object recognition: A review

Alzahrani M, Usman M, Jarraya SK, Anwar S, and Helmy T. Deep models for multi-view 3D object recognition: A review. Artificial Intelligence Review 2024; 57(12):Article 323. doi: 10.1007/s10462-024-10941-w

work page doi:10.1007/s10462-024-10941-w 2024
[27]

Fully decentralized federated learning

Lalitha A, Shekhar S, Javidi T, and Koushanfar F. Fully decentralized federated learning. In: Third Workshop on Bayesian Deep Learning (NeurIPS). Vol. 12. 2018

2018
[28]

Decentralized Federated Learning: A Survey and Perspective

Yuan L, Wang Z, Sun L, Yu PS, and Brinton CG. Decentralized Federated Learning: A Survey and Perspective. IEEE Internet of Things Journal 2024; 11:34617–38. doi: 10.1109/JIOT.2024.3407584

work page doi:10.1109/jiot.2024.3407584 2024
[29]

Randomized gossip algorithms

Boyd S, Ghosh A, Prabhakar B, and Shah D. Randomized gossip algorithms. IEEE Transactions on Information Theory 2006; 52:2508–30. Manuscript submitted to ACM Tiny Collaborative Inference for Occlusion-Robust Object Detection 39

2006
[30]

Federated learning for computer vision

Himeur Y, Varlamis I, Kheddar H, Amira A, Atalla S, Singh Y, Bensaali F, and Mansoor W. Federated learning for computer vision. arXiv 2023. arXiv:2308.13558 [cs.CV]. Available from: https://arxiv.org/abs/2308.13558

arXiv 2023
[31]

YOLO9000: Better, Faster, Stronger

Redmon J and Farhadi A. YOLO9000: Better, Faster, Stronger. arXiv 2016. arXiv:1612.08242 [cs.CV]. Available from: https://arxiv.org/abs/1612.08242

Pith/arXiv arXiv 2016
[32]

A multicore and Edge TPU-accelerated multimodal TinyML system for livestock behavior recognition

Zhang Q and Kanjo E. A multicore and Edge TPU-accelerated multimodal TinyML system for livestock behavior recognition. IEEE Internet of Things Journal 2026; 13(1):666–77. doi: 10.1109/JIOT.2025.3624811. Available from: https://arxiv.org/abs/2504.11467

work page doi:10.1109/jiot.2025.3624811 2026
[33]

Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction

Reizenstein J, Shapovalov R, Henzler P, Sbordone L, Labatut P, and Novotny D. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. arXiv 2021. arXiv:2109.00512 [cs.CV]. Available from: https://arxiv.org/abs/2109.00512

arXiv 2021
[34]

Improved Regularization of Convolutional Neural Networks with Cutout

DeVries T and Taylor GW. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017. arXiv:1708.04552 [cs.CV]. Available from: https://arxiv.org/abs/1708.04552

Pith/arXiv arXiv 2017
[35]

Communication-efficient learning of deep networks from decentralized data

McMahan B, Moore E, Ramage D, Hampson S, and Agüera y Arcas B. Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR; 2017:1273–82

2017
[36]

Representative Batch Normalization with Feature Calibration

Gao SH, Han Q, Li D, Cheng MM, and Peng P. Representative Batch Normalization with Feature Calibration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021:8669–79

2021
[37]

A Standard for the Transmission of IP Datagrams over Ethernet Networks. RFC 894. 1984 Apr. doi: 10.17487/RFC0894. Available from: https: //www.rfc-editor.org/info/rfc894

work page doi:10.17487/rfc0894 1984
[38]

Dev Board Micro datasheet

Coral. Dev Board Micro datasheet. Version 1.0. Google LLC. Available from: https://coral.ai/static/files/Coral-Dev-Board-Micro-datasheet.pdf Manuscript submitted to ACM

[1] [1]

The Internet of Things: Catching Up to an Accelerating Opportunity

Chui M, Collins M, and Patel M. The Internet of Things: Catching Up to an Accelerating Opportunity. 2021. Accessed: 2025-08-24. Available from: https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/iot-value-set-to-accelerate-through-2030-where-and-how-to-capture-it

2021

[2] [2]

Deep learning with edge computing: A review

Chen J and Ran X. Deep learning with edge computing: A review. Proceedings of the IEEE 2019; 107:1655–74. doi: 10.1109/JPROC.2019.2921977

work page doi:10.1109/jproc.2019.2921977 2019

[3] [3]

Deep learning for edge computing applications: A state-of-the-art survey

Wang F, Zhang M, Wang X, Ma X, and Liu J. Deep learning for edge computing applications: A state-of-the-art survey. IEEE Access 2020; 8:58322–36. doi: 10.1109/ACCESS.2020.2982411

work page doi:10.1109/access.2020.2982411 2020

[4] [4]

A survey of methods for low-power deep learning and computer vision

Goel A, Tung C, Lu YH, and Thiruvathukal GK. A survey of methods for low-power deep learning and computer vision. In: 2020 IEEE 6th World Forum on Internet of Things (WF-IoT). IEEE; 2020:1–6

2020

[5] [5]

Making accurate object detection at the edge: review and new approach

Huang Z, Yang S, Zhou M, Gong Z, Abusorrah A, Lin C, and Huang Z. Making accurate object detection at the edge: review and new approach. Artificial Intelligence Review 2022; 55:2245–74. doi: 10.1007/s10462-021-10059-3

work page doi:10.1007/s10462-021-10059-3 2022

[6] [6]

Human detection from unmanned aerial vehicles’ images for search and rescue missions: A state-of-the-art review

Bany Abdelnabi AA and Rabadi G. Human detection from unmanned aerial vehicles’ images for search and rescue missions: A state-of-the-art review. IEEE Access 2024; 12:152009–35. doi: 10.1109/ACCESS.2024.3479988

work page doi:10.1109/access.2024.3479988 2024

[7] [7]

UAV-based real-time survivor detection system in post-disaster search and rescue operations

Dong J, Ota K, and Dong M. UAV-based real-time survivor detection system in post-disaster search and rescue operations. IEEE Journal on Miniaturization for Air and Space Systems 2021; 2:209–19. doi: 10.1109/JMASS.2021.3083659

work page doi:10.1109/jmass.2021.3083659 2021

[8] [8]

A review of occluded objects detection in real complex scenarios for autonomous driving

Ruan J, Cui H, Huang Y, Li T, Wu C, and Zhang K. A review of occluded objects detection in real complex scenarios for autonomous driving. Green Energy and Intelligent Transportation 2023; 2:100092. doi: 10.1016/j.geits.2023.100092

work page doi:10.1016/j.geits.2023.100092 2023

[9] [9]

Lightweight deep learning for resource-constrained environments: A survey

Liu HI, Galindo M, Xie H, Wong LK, Shuai HH, Li YH, and Cheng WH. Lightweight deep learning for resource-constrained environments: A survey. ACM Computing Surveys 2024; 56(10):Article 267. doi: 10.1145/3657282

work page doi:10.1145/3657282 2024

[10] [10]

MCUNet: Tiny Deep Learning on IoT Devices

Lin J, Chen WM, Lin Y, Cohn J, Gan C, and Han S. MCUNet: Tiny Deep Learning on IoT Devices. arXiv 2020. arXiv:2007.10319 [cs.CV]. Available from: https://arxiv.org/abs/2007.10319

arXiv 2020

[11] [11]

MCUNetV2: Memory-Efficient Patch-Based Inference for Tiny Deep Learning

Lin J, Chen WM, Cai H, Gan C, and Han S. MCUNetV2: Memory-Efficient Patch-Based Inference for Tiny Deep Learning. arXiv 2021. arXiv:2110.15352 [cs.CV]. Available from: https://arxiv.org/abs/2110.15352

arXiv 2021

[12] [12]

Robustness of object recognition under extreme occlusion in humans and computational models

Zhu H, Tang P, Park J, Park S, and Yuille A. Robustness of object recognition under extreme occlusion in humans and computational models. arXiv

[13] [13]

Available from: https://arxiv.org/abs/1905.04598

arXiv:1905.04598 [cs.CV]. Available from: https://arxiv.org/abs/1905.04598

Pith/arXiv arXiv 1905

[14] [14]

Occlusion handling in generic object detection: A review

Saleh K, Szénási S, and Vámossy Z. Occlusion handling in generic object detection: A review. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI). IEEE; 2021:477–84

2021

[15] [15]

Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion

Kortylewski A, He J, Liu Q, and Yuille AL. Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:8940–49

2020

[16] [16]

Robust object detection under occlusion with context-aware CompositionalNets

Wang A, Sun Y, Kortylewski A, and Yuille AL. Robust object detection under occlusion with context-aware CompositionalNets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020:12645–54. Available from: https://openaccess.thecvf.com/content_ CVPR_2020/html/Wang_Robust_Object_Detection_Under_Occlusion_With_Conte...

2020

[17] [17]

DeepVoting: A robust and explainable deep network for semantic part detection under partial occlusion

Zhang Z, Xie C, Wang J, Xie L, and Yuille AL. DeepVoting: A robust and explainable deep network for semantic part detection under partial occlusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:1372–80

2018

[18] [18]

Multiview objects recognition using deep learning-based Wrap-CNN with voting scheme

Balamurugan D, Aravinth SS, Reddy PCS, Rupani A, and Manikandan A. Multiview objects recognition using deep learning-based Wrap-CNN with voting scheme. Neural Processing Letters 2022; 54(3):1495–521. doi: 10.1007/s11063-021-10679-4

work page doi:10.1007/s11063-021-10679-4 2022

[19] [19]

Edge-device collaborative computing for multi-view classification

Palena M, Cerquitelli T, and Chiasserini CF. Edge-device collaborative computing for multi-view classification. Computer Networks 2024; 254:110823. doi: 10.1016/j.comnet.2024.110823

work page doi:10.1016/j.comnet.2024.110823 2024

[20] [20]

Multimodal machine learning: A survey and taxonomy

Baltrušaitis T, Ahuja C, and Morency LP. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 2019; 41(2):423–43. doi: 10.1109/TPAMI.2018.2798607

work page doi:10.1109/tpami.2018.2798607 2019

[21] [21]

Multimodal fusion for multimedia analysis: A survey

Atrey PK, Hossain MA, El Saddik A, and Kankanhalli MS. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 2010; 16:345–79

2010

[22] [22]

Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

Boulahia SY, Amamra A, Madi MR, and Daikh S. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 2021; 32(6):121. doi: 10.1007/s00138-021-01249-8

work page doi:10.1007/s00138-021-01249-8 2021

[23] [23]

Multi-view object detection based on deep learning

Tang C, Ling Y, Yang X, Jin W, and Zheng C. Multi-view object detection based on deep learning. Applied Sciences 2018; 8(9):1423. doi: 10.3390/app8091423

work page doi:10.3390/app8091423 2018

[24] [24]

Cross-Domain Federated Object Detection

Su S, Li B, Zhang C, Yang M, and Xue X. Cross-Domain Federated Object Detection. In: 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE; 2023:1469–74. doi: 10.1109/ICME55011.2023.00254. Available from: http://dx.doi.org/10.1109/ICME55011.2023.00254

work page doi:10.1109/icme55011.2023.00254 2023

[25] [25]

Weighted boxes fusion: Ensembling boxes from different object detection models

Solovyev R, Wang W, and Gabruseva T. Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing 2021; 107:104117. doi: 10.1016/j.imavis.2021.104117. Available from: http://dx.doi.org/10.1016/j.imavis.2021.104117

work page doi:10.1016/j.imavis.2021.104117 2021

[26] [26]

Deep models for multi-view 3D object recognition: A review

Alzahrani M, Usman M, Jarraya SK, Anwar S, and Helmy T. Deep models for multi-view 3D object recognition: A review. Artificial Intelligence Review 2024; 57(12):Article 323. doi: 10.1007/s10462-024-10941-w

work page doi:10.1007/s10462-024-10941-w 2024

[27] [27]

Fully decentralized federated learning

Lalitha A, Shekhar S, Javidi T, and Koushanfar F. Fully decentralized federated learning. In: Third Workshop on Bayesian Deep Learning (NeurIPS). Vol. 12. 2018

2018

[28] [28]

Decentralized Federated Learning: A Survey and Perspective

Yuan L, Wang Z, Sun L, Yu PS, and Brinton CG. Decentralized Federated Learning: A Survey and Perspective. IEEE Internet of Things Journal 2024; 11:34617–38. doi: 10.1109/JIOT.2024.3407584

work page doi:10.1109/jiot.2024.3407584 2024

[29] [29]

Randomized gossip algorithms

Boyd S, Ghosh A, Prabhakar B, and Shah D. Randomized gossip algorithms. IEEE Transactions on Information Theory 2006; 52:2508–30. Manuscript submitted to ACM Tiny Collaborative Inference for Occlusion-Robust Object Detection 39

2006

[30] [30]

Federated learning for computer vision

Himeur Y, Varlamis I, Kheddar H, Amira A, Atalla S, Singh Y, Bensaali F, and Mansoor W. Federated learning for computer vision. arXiv 2023. arXiv:2308.13558 [cs.CV]. Available from: https://arxiv.org/abs/2308.13558

arXiv 2023

[31] [31]

YOLO9000: Better, Faster, Stronger

Redmon J and Farhadi A. YOLO9000: Better, Faster, Stronger. arXiv 2016. arXiv:1612.08242 [cs.CV]. Available from: https://arxiv.org/abs/1612.08242

Pith/arXiv arXiv 2016

[32] [32]

A multicore and Edge TPU-accelerated multimodal TinyML system for livestock behavior recognition

Zhang Q and Kanjo E. A multicore and Edge TPU-accelerated multimodal TinyML system for livestock behavior recognition. IEEE Internet of Things Journal 2026; 13(1):666–77. doi: 10.1109/JIOT.2025.3624811. Available from: https://arxiv.org/abs/2504.11467

work page doi:10.1109/jiot.2025.3624811 2026

[33] [33]

Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction

Reizenstein J, Shapovalov R, Henzler P, Sbordone L, Labatut P, and Novotny D. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. arXiv 2021. arXiv:2109.00512 [cs.CV]. Available from: https://arxiv.org/abs/2109.00512

arXiv 2021

[34] [34]

Improved Regularization of Convolutional Neural Networks with Cutout

DeVries T and Taylor GW. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017. arXiv:1708.04552 [cs.CV]. Available from: https://arxiv.org/abs/1708.04552

Pith/arXiv arXiv 2017

[35] [35]

Communication-efficient learning of deep networks from decentralized data

McMahan B, Moore E, Ramage D, Hampson S, and Agüera y Arcas B. Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR; 2017:1273–82

2017

[36] [36]

Representative Batch Normalization with Feature Calibration

Gao SH, Han Q, Li D, Cheng MM, and Peng P. Representative Batch Normalization with Feature Calibration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021:8669–79

2021

[37] [37]

A Standard for the Transmission of IP Datagrams over Ethernet Networks. RFC 894. 1984 Apr. doi: 10.17487/RFC0894. Available from: https: //www.rfc-editor.org/info/rfc894

work page doi:10.17487/rfc0894 1984

[38] [38]

Dev Board Micro datasheet

Coral. Dev Board Micro datasheet. Version 1.0. Google LLC. Available from: https://coral.ai/static/files/Coral-Dev-Board-Micro-datasheet.pdf Manuscript submitted to ACM