pith. sign in

arxiv: 2606.02894 · v2 · pith:IRDPBDX4new · submitted 2026-06-01 · 💻 cs.CV

Tiny Collaborative Inference for Occlusion-Robust Object Detection

Pith reviewed 2026-06-28 14:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords occlusion-robust object detectioncollaborative inferenceweighted boxes fusionedge AIMCUNetYOLOv2multi-view fusiontiny hardware
0
0 comments X

The pith

Decision-level fusion with weighted boxes outperforms feature fusion for occlusion-robust detection on devices under 1 MB SRAM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests ways to combine detections from multiple tiny cameras so that partially hidden objects are still found reliably on hardware too small for heavy models. It compares early fusion of internal features against late fusion of final bounding boxes using weighted boxes fusion, and shows the late method gives higher accuracy in every occlusion test while using little extra data transfer. The work demonstrates this can run directly on the devices themselves without a central host, producing more frames with detections than a single unit alone. A sympathetic reader would care because search-and-rescue sensors often operate in cluttered environments where one view is blocked and communication must stay minimal.

Core claim

The central claim is that decision-level fusion via Weighted Boxes Fusion outperforms feature-level fusion under all tested occlusion conditions on MCUNet-YOLOv2 models quantized for less than 1 MB SRAM, with gains reaching 0.2736 mAP in two-view asymmetric cases and 0.3827 mAP when extended to three views, at roughly 1.3 KB communication per exchange, and that this fusion executes on-device on Coral Dev Board Micro units with negligible added energy, increasing autonomous coverage from 47 to 61 frames in a 301.9-second session.

What carries the argument

Weighted Boxes Fusion (WBF), which merges bounding-box outputs from separate detectors by weighting them according to their confidence scores.

If this is right

  • Three-view fusion adds further accuracy at only modest extra communication cost.
  • On-device WBF removes the need for a host computer and raises the fraction of frames that contain detections.
  • The same fusion step works across both USB-relay and Wi-Fi peer-to-peer setups.
  • Federated learning remains possible but shows limited gains when data across nodes are non-iid.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar late-fusion logic could apply to other bandwidth-limited multi-sensor tasks such as distributed tracking.
  • Energy measurements on longer missions would show whether the coverage gain translates into extended battery runtime.
  • Replacing the current backbone with newer tiny detectors might lower the baseline memory requirement even further.

Load-bearing premise

The specific occlusion patterns and datasets used produce accuracy and energy gains that represent real search-and-rescue conditions on these boards.

What would settle it

A field deployment in actual search-and-rescue terrain where the measured mAP improvement from two- or three-view WBF falls below 0.1 and coverage gain disappears.

Figures

Figures reproduced from arXiv: 2606.02894 by Chieh-Tung Cheng, Eiman Kanjo, Mustafa Aslanov.

Figure 1
Figure 1. Figure 1: Symmetric (left) and asymmetric (right) quantisation representation [ [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of common network topologies used in DFL [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the system architecture: (Left) model pre-training with MCUNet backbone and YOLOv2 head; (Middle) collabora [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Backbone structure of MCUNet map produced by the second convolutional layer in the detection head. The combined representation compensates for the coarse resolution of the final layer, which is effective for large objects but insufficient for detecting small ones. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: YOLOv2 detection head used in this work [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multi-view feature-level fusion pipeline [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Multi-view decision-level fusion pipeline [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Loss curve for fine-tuning MCUNet-YOLOv2 with [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Trade-off between accuracy and FLOPs under varying input resolutions. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Backbone comparison: MCUNet, MobileNetV2, and ResNet-18. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Feature-level vs. decision-level fusion accuracy across occlusion pairs. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: WBF (red) and baseline (blue) precision–recall comparison (view 1). [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: WBF (red) and baseline (blue) precision–recall comparison (view 2). [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of confidence averaging in WBF. achieving improvements of +0.1538, +0.2124, and +0.1815 mAP across the three views. These results demonstrate that three-view collaborative inference remains effective and robust even under severe occlusion. In [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Two-view versus three-view fusion accuracy across occlusion triplets. [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Accuracy versus communication cost for two-view and three-view fusion. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Training loss curve of FedAvg over successive rounds. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
read the original abstract

Edge AI nodes for search and rescue are increasingly expected to run computer vision locally, yet ultra-low-end hardware imposes hard constraints on memory, compute, and inter-device communication. This work addresses occlusion-robust object detection on devices with less than 1 MB SRAM by combining an MCUNet backbone, a YOLOv2 detection head, and Lite quantisation. Two collaborative inference strategies are evaluated: feature-level fusion, concatenating intermediate feature maps, and decision-level fusion via Weighted Boxes Fusion (WBF). WBF outperforms feature-level fusion under all tested occlusion conditions, yielding gains of up to +0.2736 mAP in asymmetric scenarios. Extending fusion to three views improves accuracy further (up to +0.3827 mAP) at modest communication overhead (~1.3 KB per exchange). Hardware experiments progress from a host-assisted USB-relay baseline to a Wi-Fi peer-to-peer deployment on two Coral Dev Board Micro units, where WBF executes on-device with negligible communication energy relative to inference. In a 301.9 s autonomous session of 108 frames, fused output is produced on 61 frames versus 47 for a single board - a coverage gain of +29.8%. A decentralised federated learning feasibility note is included but not treated as a primary result, as performance remains limited under non-iid data. The results support decision-level fusion as a viable option for improving occlusion robustness in small-scale edge object detection, including host-free multi-board operation on ultra-low-end hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper evaluates two collaborative inference strategies—feature-level fusion and decision-level fusion via Weighted Boxes Fusion (WBF)—for occlusion-robust object detection on ultra-low-end edge devices (<1 MB SRAM) using an MCUNet backbone with YOLOv2 head and Lite quantization. It reports that WBF outperforms feature-level fusion under tested occlusion conditions (gains up to +0.2736 mAP in asymmetric cases), with further gains from three-view fusion (+0.3827 mAP) at ~1.3 KB overhead per exchange. Hardware experiments on Coral Dev Board Micro units progress from USB-relay to Wi-Fi P2P, claiming on-device WBF execution with negligible communication energy and a +29.8% coverage gain (61 vs. 47 frames) in a 301.9 s / 108-frame autonomous session. A brief federated learning note is included but not central.

Significance. If the hardware claims hold, the work supplies concrete empirical evidence that decision-level fusion can improve occlusion robustness and coverage in multi-view setups on severely memory-constrained devices, with modest communication cost. The reported mAP deltas and coverage numbers from real hardware runs are a strength; the approach could be relevant for search-and-rescue edge AI if energy and dataset details are clarified.

major comments (1)
  1. [Abstract / hardware experiments] Abstract and hardware experiments section: the central claim that WBF on-device execution incurs 'negligible communication energy relative to inference' in the Wi-Fi P2P Coral Dev Board Micro deployment is unsupported by any quantitative measurements (e.g., mJ per inference vs. per 1.3 KB exchange, or power traces). This directly weakens the hardware-practicality half of the contribution for <1 MB SRAM devices.
minor comments (1)
  1. [Abstract / results] The abstract and results mention specific datasets and occlusion generation methods but do not provide sufficient detail on exact occlusion synthesis procedure or statistical significance tests for the mAP deltas.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the hardware energy claim. We address it directly below.

read point-by-point responses
  1. Referee: [Abstract / hardware experiments] Abstract and hardware experiments section: the central claim that WBF on-device execution incurs 'negligible communication energy relative to inference' in the Wi-Fi P2P Coral Dev Board Micro deployment is unsupported by any quantitative measurements (e.g., mJ per inference vs. per 1.3 KB exchange, or power traces). This directly weakens the hardware-practicality half of the contribution for <1 MB SRAM devices.

    Authors: We agree that the manuscript provides no direct quantitative energy measurements (mJ, power traces, or per-inference vs. per-exchange comparisons) to substantiate the 'negligible communication energy' phrasing. The statement was based on the modest payload size (~1.3 KB) and the known high compute cost of MCUNet inference on the target platform, but this remains a qualitative inference rather than an empirically measured result. In the revised version we will (1) remove or qualify the unqualified 'negligible' wording in both the abstract and hardware-experiments section, (2) explicitly note the absence of direct energy profiling, and (3) add a short discussion of the data-size argument together with a reference to typical Wi-Fi energy costs on similar Cortex-M devices. These changes will make the hardware-practicality claims more precise while preserving the reported coverage and latency numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential predictions

full rationale

The paper evaluates two fusion strategies (feature-level vs. WBF decision-level) via direct mAP measurements on occlusion datasets and hardware timing/energy on Coral Dev Board Micro units. No equations, fitted parameters, or derivation chains are present; results are reported as measured outputs (e.g., +0.2736 mAP, 1.3 KB overhead, 301.9 s session coverage). The reader's assessment of score 1.0 is consistent with the absence of any load-bearing steps that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is an empirical comparison of standard techniques on constrained hardware.

pith-pipeline@v0.9.1-grok · 5807 in / 1168 out tokens · 24864 ms · 2026-06-28T14:54:59.147123+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 18 canonical work pages

  1. [1]

    The Internet of Things: Catching Up to an Accelerating Opportunity

    Chui M, Collins M, and Patel M. The Internet of Things: Catching Up to an Accelerating Opportunity. 2021. Accessed: 2025-08-24. Available from: https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/iot-value-set-to-accelerate-through-2030-where-and-how-to-capture-it

  2. [2]

    Deep learning with edge computing: A review

    Chen J and Ran X. Deep learning with edge computing: A review. Proceedings of the IEEE 2019; 107:1655–74. doi: 10.1109/JPROC.2019.2921977

  3. [3]

    Deep learning for edge computing applications: A state-of-the-art survey

    Wang F, Zhang M, Wang X, Ma X, and Liu J. Deep learning for edge computing applications: A state-of-the-art survey. IEEE Access 2020; 8:58322–36. doi: 10.1109/ACCESS.2020.2982411

  4. [4]

    A survey of methods for low-power deep learning and computer vision

    Goel A, Tung C, Lu YH, and Thiruvathukal GK. A survey of methods for low-power deep learning and computer vision. In: 2020 IEEE 6th World Forum on Internet of Things (WF-IoT). IEEE; 2020:1–6

  5. [5]

    Making accurate object detection at the edge: review and new approach

    Huang Z, Yang S, Zhou M, Gong Z, Abusorrah A, Lin C, and Huang Z. Making accurate object detection at the edge: review and new approach. Artificial Intelligence Review 2022; 55:2245–74. doi: 10.1007/s10462-021-10059-3

  6. [6]

    Human detection from unmanned aerial vehicles’ images for search and rescue missions: A state-of-the-art review

    Bany Abdelnabi AA and Rabadi G. Human detection from unmanned aerial vehicles’ images for search and rescue missions: A state-of-the-art review. IEEE Access 2024; 12:152009–35. doi: 10.1109/ACCESS.2024.3479988

  7. [7]

    UAV-based real-time survivor detection system in post-disaster search and rescue operations

    Dong J, Ota K, and Dong M. UAV-based real-time survivor detection system in post-disaster search and rescue operations. IEEE Journal on Miniaturization for Air and Space Systems 2021; 2:209–19. doi: 10.1109/JMASS.2021.3083659

  8. [8]

    A review of occluded objects detection in real complex scenarios for autonomous driving

    Ruan J, Cui H, Huang Y, Li T, Wu C, and Zhang K. A review of occluded objects detection in real complex scenarios for autonomous driving. Green Energy and Intelligent Transportation 2023; 2:100092. doi: 10.1016/j.geits.2023.100092

  9. [9]

    Lightweight deep learning for resource-constrained environments: A survey

    Liu HI, Galindo M, Xie H, Wong LK, Shuai HH, Li YH, and Cheng WH. Lightweight deep learning for resource-constrained environments: A survey. ACM Computing Surveys 2024; 56(10):Article 267. doi: 10.1145/3657282

  10. [10]

    MCUNet: Tiny Deep Learning on IoT Devices

    Lin J, Chen WM, Lin Y, Cohn J, Gan C, and Han S. MCUNet: Tiny Deep Learning on IoT Devices. arXiv 2020. arXiv:2007.10319 [cs.CV]. Available from: https://arxiv.org/abs/2007.10319

  11. [11]

    MCUNetV2: Memory-Efficient Patch-Based Inference for Tiny Deep Learning

    Lin J, Chen WM, Cai H, Gan C, and Han S. MCUNetV2: Memory-Efficient Patch-Based Inference for Tiny Deep Learning. arXiv 2021. arXiv:2110.15352 [cs.CV]. Available from: https://arxiv.org/abs/2110.15352

  12. [12]

    Robustness of object recognition under extreme occlusion in humans and computational models

    Zhu H, Tang P, Park J, Park S, and Yuille A. Robustness of object recognition under extreme occlusion in humans and computational models. arXiv

  13. [13]

    Available from: https://arxiv.org/abs/1905.04598

    arXiv:1905.04598 [cs.CV]. Available from: https://arxiv.org/abs/1905.04598

  14. [14]

    Occlusion handling in generic object detection: A review

    Saleh K, Szénási S, and Vámossy Z. Occlusion handling in generic object detection: A review. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI). IEEE; 2021:477–84

  15. [15]

    Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion

    Kortylewski A, He J, Liu Q, and Yuille AL. Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:8940–49

  16. [16]

    Robust object detection under occlusion with context-aware CompositionalNets

    Wang A, Sun Y, Kortylewski A, and Yuille AL. Robust object detection under occlusion with context-aware CompositionalNets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020:12645–54. Available from: https://openaccess.thecvf.com/content_ CVPR_2020/html/Wang_Robust_Object_Detection_Under_Occlusion_With_Conte...

  17. [17]

    DeepVoting: A robust and explainable deep network for semantic part detection under partial occlusion

    Zhang Z, Xie C, Wang J, Xie L, and Yuille AL. DeepVoting: A robust and explainable deep network for semantic part detection under partial occlusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:1372–80

  18. [18]

    Multiview objects recognition using deep learning-based Wrap-CNN with voting scheme

    Balamurugan D, Aravinth SS, Reddy PCS, Rupani A, and Manikandan A. Multiview objects recognition using deep learning-based Wrap-CNN with voting scheme. Neural Processing Letters 2022; 54(3):1495–521. doi: 10.1007/s11063-021-10679-4

  19. [19]

    Edge-device collaborative computing for multi-view classification

    Palena M, Cerquitelli T, and Chiasserini CF. Edge-device collaborative computing for multi-view classification. Computer Networks 2024; 254:110823. doi: 10.1016/j.comnet.2024.110823

  20. [20]

    Multimodal machine learning: A survey and taxonomy

    Baltrušaitis T, Ahuja C, and Morency LP. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 2019; 41(2):423–43. doi: 10.1109/TPAMI.2018.2798607

  21. [21]

    Multimodal fusion for multimedia analysis: A survey

    Atrey PK, Hossain MA, El Saddik A, and Kankanhalli MS. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 2010; 16:345–79

  22. [22]

    Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

    Boulahia SY, Amamra A, Madi MR, and Daikh S. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 2021; 32(6):121. doi: 10.1007/s00138-021-01249-8

  23. [23]

    Multi-view object detection based on deep learning

    Tang C, Ling Y, Yang X, Jin W, and Zheng C. Multi-view object detection based on deep learning. Applied Sciences 2018; 8(9):1423. doi: 10.3390/app8091423

  24. [24]

    Cross-Domain Federated Object Detection

    Su S, Li B, Zhang C, Yang M, and Xue X. Cross-Domain Federated Object Detection. In: 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE; 2023:1469–74. doi: 10.1109/ICME55011.2023.00254. Available from: http://dx.doi.org/10.1109/ICME55011.2023.00254

  25. [25]

    Weighted boxes fusion: Ensembling boxes from different object detection models

    Solovyev R, Wang W, and Gabruseva T. Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing 2021; 107:104117. doi: 10.1016/j.imavis.2021.104117. Available from: http://dx.doi.org/10.1016/j.imavis.2021.104117

  26. [26]

    Deep models for multi-view 3D object recognition: A review

    Alzahrani M, Usman M, Jarraya SK, Anwar S, and Helmy T. Deep models for multi-view 3D object recognition: A review. Artificial Intelligence Review 2024; 57(12):Article 323. doi: 10.1007/s10462-024-10941-w

  27. [27]

    Fully decentralized federated learning

    Lalitha A, Shekhar S, Javidi T, and Koushanfar F. Fully decentralized federated learning. In: Third Workshop on Bayesian Deep Learning (NeurIPS). Vol. 12. 2018

  28. [28]

    Decentralized Federated Learning: A Survey and Perspective

    Yuan L, Wang Z, Sun L, Yu PS, and Brinton CG. Decentralized Federated Learning: A Survey and Perspective. IEEE Internet of Things Journal 2024; 11:34617–38. doi: 10.1109/JIOT.2024.3407584

  29. [29]

    Randomized gossip algorithms

    Boyd S, Ghosh A, Prabhakar B, and Shah D. Randomized gossip algorithms. IEEE Transactions on Information Theory 2006; 52:2508–30. Manuscript submitted to ACM Tiny Collaborative Inference for Occlusion-Robust Object Detection 39

  30. [30]

    Federated learning for computer vision

    Himeur Y, Varlamis I, Kheddar H, Amira A, Atalla S, Singh Y, Bensaali F, and Mansoor W. Federated learning for computer vision. arXiv 2023. arXiv:2308.13558 [cs.CV]. Available from: https://arxiv.org/abs/2308.13558

  31. [31]

    YOLO9000: Better, Faster, Stronger

    Redmon J and Farhadi A. YOLO9000: Better, Faster, Stronger. arXiv 2016. arXiv:1612.08242 [cs.CV]. Available from: https://arxiv.org/abs/1612.08242

  32. [32]

    A multicore and Edge TPU-accelerated multimodal TinyML system for livestock behavior recognition

    Zhang Q and Kanjo E. A multicore and Edge TPU-accelerated multimodal TinyML system for livestock behavior recognition. IEEE Internet of Things Journal 2026; 13(1):666–77. doi: 10.1109/JIOT.2025.3624811. Available from: https://arxiv.org/abs/2504.11467

  33. [33]

    Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction

    Reizenstein J, Shapovalov R, Henzler P, Sbordone L, Labatut P, and Novotny D. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. arXiv 2021. arXiv:2109.00512 [cs.CV]. Available from: https://arxiv.org/abs/2109.00512

  34. [34]

    Improved Regularization of Convolutional Neural Networks with Cutout

    DeVries T and Taylor GW. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017. arXiv:1708.04552 [cs.CV]. Available from: https://arxiv.org/abs/1708.04552

  35. [35]

    Communication-efficient learning of deep networks from decentralized data

    McMahan B, Moore E, Ramage D, Hampson S, and Agüera y Arcas B. Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR; 2017:1273–82

  36. [36]

    Representative Batch Normalization with Feature Calibration

    Gao SH, Han Q, Li D, Cheng MM, and Peng P. Representative Batch Normalization with Feature Calibration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021:8669–79

  37. [37]

    A Standard for the Transmission of IP Datagrams over Ethernet Networks. RFC 894. 1984 Apr. doi: 10.17487/RFC0894. Available from: https: //www.rfc-editor.org/info/rfc894

  38. [38]

    Dev Board Micro datasheet

    Coral. Dev Board Micro datasheet. Version 1.0. Google LLC. Available from: https://coral.ai/static/files/Coral-Dev-Board-Micro-datasheet.pdf Manuscript submitted to ACM