pith. sign in

arxiv: 2407.15199 · v3 · pith:LABWPN5Qnew · submitted 2024-07-21 · 💻 cs.CV · cs.CY

Multiple Object Detection and Tracking in Panoramic Videos for Cycling Safety Analysis

Pith reviewed 2026-05-23 23:02 UTC · model grok-4.3

classification 💻 cs.CV cs.CY
keywords panoramic videoobject detectionmulti-object trackingcycling safetyovertaking detection360 degree imagerycomputer visionvehicle tracking
0
0 comments X

The pith

A framework segments 360° images into sub-images and modifies trackers for boundary continuity to improve detection and tracking in panoramic cycling videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that panoramic videos can overcome distortions, small objects, and boundary issues that defeat standard models by first projecting sub-images for detection then adjusting tracking to preserve continuity and use category cues. A sympathetic reader would care because crash records lack the spatial and temporal detail needed to spot everyday risk factors for cyclists, and bike-mounted 360° cameras could supply that missing data at scale. The method is tested on London roadway videos and reports gains in average precision, a 10% drop in identification switches, 2.7% better identification precision, and 0.82 F-score on overtaking events.

Core claim

The paper claims that its three-step framework—segmenting and projecting 360° images into sub-images to enhance detection accuracy, modifying multi-object tracking models to incorporate boundary continuity and object category information, and validating the pipeline on vehicle overtaking detection—produces higher average precision across resolutions, a 10.0% decrease in identification switches, a 2.7% improvement in identification precision, and an F-score of 0.82 when applied to real-world panoramic videos recorded by cyclists on London's roadways.

What carries the argument

The three-step framework of sub-image segmentation and projection for detection, boundary-continuous multi-object tracking that adds category information, and overtaking detection validation.

If this is right

  • Higher average precision for object detection on panoramic imagery at varying resolutions.
  • 10.0% reduction in identification switches and 2.7% gain in identification precision for multi-object tracking.
  • Vehicle overtaking detection reaches an F-score of 0.82 in diverse real-world cycling conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same segmentation-plus-boundary-tracking steps could be tested on panoramic footage from other moving platforms such as cars or drones.
  • Pairing the tracked objects with route GPS data would allow mapping of high-risk overtaking locations along specific streets.
  • The framework might be extended to detect additional events such as close passes or dooring by cyclists.

Load-bearing premise

That segmenting and projecting the original 360° images into sub-images will improve detection accuracy without introducing new errors from projection artifacts or loss of context at sub-image boundaries.

What would settle it

Applying standard detection and tracking models directly to the same set of London panoramic videos and measuring equal or higher average precision together with equal or fewer identification switches than the proposed framework would falsify the central claim.

Figures

Figures reproduced from arXiv: 2407.15199 by Ilya Ilyankou, James Haworth, Jingwei Guo, Meihui Wang, Natchapon Jongwiriyanurak, Nicola Christie, Xiaowei Gao, Yitai Cheng.

Figure 1
Figure 1. Figure 1: An example of the images before (left) and after (right) turning on the ‘Horizon [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: How objects across the boundaries were annotated in CVAT [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow of merging the split objects in MOT annotation files [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Position distribution of the ground truth bounding boxes in the dataset [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Category distribution of the ground truth [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Size distribution of the ground truth bounding [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Three steps of the methodology behaviour analysis of panoramic cycling videos, this project designed and implemented a 293 three-step methodology, as shown in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Workflow of the proposed method for improving object detection on panoramic images [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Two cases of projection from a sphere to a plane: (a) towards the z-axis (b) towards [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Projections of an object in different sub-images: The closer it is to the edges of a [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Process of reprojecting a point from perspective to equirectangular [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Process of reprojecting a bounding box from perspective to equirectangular [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Process of reprojecting an object which is across the boundaries [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Method for processing the objects which are displayed in several sub-images [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Workflow of DeepSORT (Avola et al., 2021) and the proposed method for improving its performance on panoramic cycling videos 4.2 Improving Multiple Object Tracking on Panoramic Cycling 401 Videos 402 The second step of the proposed methodology is to improve the applicability of DeepSORT 403 (Wojke et al., 2017), the most commonly-used MOT model, to panoramic cycling videos. 404 As shown in [PITH_FULL_IMAG… view at source ↗
Figure 16
Figure 16. Figure 16: An example of wrong associations between tracks and detections [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Process of category filtering on a cost matrix [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Method for tracking the detections which are shown twice in one frame [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Preparations for the improved distance calculation process (suitable for Mahalanobis [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Workflow of the application which can detect the overtaking behaviour of the sur [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Flow chart of the overtaking behaviour detection for a specific object [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Examples of how category information avoids incorrect data matching (Left: Original [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Tracking results of the objects which are shown in two parts before and after the [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Tracking results of the objects which leave from one side of the image and return [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: The instance which gets different detection results in different applications [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Example of a ghost bus, which in fact is a shop front that is classed as a bus in a [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: The cyclist’s food delivery-style backpack, as well as rain drops, partly obstruct the [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Example of unsuccessful detection of the proposed model on objects across the sub [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Examples of ID switching when the angle of a vehicle toward the camera changes [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗
read the original abstract

Cyclists face a disproportionate risk of injury, yet conventional crash records are too sparse to identify risk factors at fine spatial and temporal scales. Recently, naturalistic studies have used video data to capture the complex behavioural and infrastructural risk factors. A promising format is panoramic video, which can record 360$^\circ$ views around a rider. However, its use is limited by distortions, large numbers of small objects, and boundary continuity, which cannot be handled using existing computer vision models. This research proposes a novel three-step framework: (1) enhancing object detection accuracy on panoramic imagery by segmenting and projecting the original 360$^\circ$ images into sub-images; (2) modifying multi-object tracking models to incorporate boundary continuity and object category information; and (3) validating through a real-world application of vehicle overtaking detection. The methodology is evaluated using panoramic videos recorded by cyclists on London's roadways under diverse conditions. Experimental results demonstrate improvements over baselines, achieving higher average precision across varying image resolutions. Moreover, the enhanced tracking approach yields a 10.0% decrease in identification switches and a 2.7% improvement in identification precision. The overtaking detection task achieves a high F-score of 0.82, illustrating the practical effectiveness of the proposed method in real-world cycling safety scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a three-step framework for multiple object detection and tracking in 360° panoramic videos to support cycling safety analysis: (1) segmenting and projecting original images into sub-images to improve detection of small/distant objects, (2) adapting multi-object trackers to enforce boundary continuity and incorporate object category information, and (3) demonstrating the pipeline on a vehicle overtaking detection task. Evaluation uses real-world panoramic videos recorded by cyclists on London roadways; the abstract reports higher average precision across resolutions, a 10.0% reduction in identification switches, a 2.7% gain in identification precision, and an F-score of 0.82 on overtaking detection.

Significance. If the numeric gains are robust, the work would be significant for enabling finer-grained naturalistic studies of cyclist risk factors using panoramic video, a format that standard CV models handle poorly due to distortion and continuity issues. The real-world London dataset and the concrete application to overtaking detection are strengths that tie the technical contributions to a clear safety use case.

major comments (2)
  1. [Abstract] Abstract: The headline claims (10.0% decrease in identification switches, 2.7% improvement in identification precision, F-score 0.82) are presented without error bars, ablation results isolating the contribution of the projection step versus the tracker modifications, or any description of training data, model architectures, or baseline implementations. These omissions make the central performance assertions difficult to evaluate.
  2. [Abstract] Abstract (step 1 of framework): The claim that segmenting and projecting 360° images into sub-images improves detection accuracy rests on the untested assumption that this step does not introduce new errors from projection artifacts or loss of context at sub-image boundaries; no quantitative check of these effects is referenced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to improve the clarity of the reported results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims (10.0% decrease in identification switches, 2.7% improvement in identification precision, F-score 0.82) are presented without error bars, ablation results isolating the contribution of the projection step versus the tracker modifications, or any description of training data, model architectures, or baseline implementations. These omissions make the central performance assertions difficult to evaluate.

    Authors: The abstract provides a concise summary of the headline results. Full details on training data, model architectures (including YOLOv5 for detection and the modified ByteTrack), baseline implementations, and ablation studies that isolate the projection step from the tracker modifications appear in Sections 3 and 4 of the manuscript, along with the overtaking detection evaluation protocol. Error bars are omitted because the primary evaluation uses a single fixed real-world dataset; however, we can add standard deviations from k-fold splits on the dataset in a revision. To address the concern directly, we will revise the abstract to include a short reference to the evaluation setup and point readers to the relevant sections. revision: partial

  2. Referee: [Abstract] Abstract (step 1 of framework): The claim that segmenting and projecting 360° images into sub-images improves detection accuracy rests on the untested assumption that this step does not introduce new errors from projection artifacts or loss of context at sub-image boundaries; no quantitative check of these effects is referenced.

    Authors: We agree that an explicit quantitative check of projection artifacts and boundary context loss would strengthen the presentation. The manuscript demonstrates the net benefit of the segmentation/projection step via improved average precision across resolutions relative to direct panoramic detection (reported in the results). Overlapping sub-images are used to retain context, but we did not isolate artifact effects in a dedicated experiment. We will add this analysis, including boundary-specific precision metrics, in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical three-step pipeline for panoramic video analysis, with performance gains measured on real-world cyclist footage using standard detection/tracking metrics (AP, ID switches, precision, F-score). No equations, fitted parameters, or self-citations are presented that reduce any reported result to an input by construction; the evaluation remains externally falsifiable against held-out video data and baseline models. The framework choices (segmentation/projection, boundary continuity) are motivated by domain challenges rather than derived from the metrics themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard assumptions of modern computer vision pipelines (pre-trained detectors, IoU-based association) with no new free parameters, axioms, or invented entities introduced.

pith-pipeline@v0.9.0 · 5793 in / 1158 out tokens · 17526 ms · 2026-05-23T23:02:03.227697+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CycleTrajectory: An End-to-End Pipeline for Enriching and Analyzing GPS Trajectories to Understand Cycling Behavior and Environment

    cs.DB 2024-06 unverdicted novelty 4.0

    CycleTrajectory is a pipeline that filters, resamples, map-matches via OSRM, enriches with OSM road data, and derives cycling metrics from action-camera GPS trajectories, reporting 5.64% map-matching error.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper

  1. [1]

    Therefore, many local governments and trans-38 port authorities are implementing policies to encourage cycling as an alternative to car39 travel

    Introduction 35 Cycling, as an active travel mode, has zero carbon emissions (Massink et al., 2011) and36 can support reductions in traffic congestion (Brunsing, 1997) while improving the health37 of urban residents (Wanner et al., 2012). Therefore, many local governments and trans-38 port authorities are implementing policies to encourage cycling as an a...

  2. [2]

    In combination, these actions and interactions within the environment 109 contribute to crash risk

    Related work 105 2.1 Challenges in Understanding Cycling Crash Risk 106 Urban transport networks are complex environments in which many agents (pedestrians,107 cyclists and other road users) interact with each other while using different types of 108 3 infrastructure. In combination, these actions and interactions within the environment 109 contribute to ...

  3. [3]

    154 Existing object detection and MOT algorithms are trained on video from cameras with155 a limited field of view (FOV)

    and to track objects using the views from multiple cameras (Jiménez-Bravo et al.,153 2022). 154 Existing object detection and MOT algorithms are trained on video from cameras with155 a limited field of view (FOV). This suits the data that has been collected in naturalistic156 cycling studies to date, which contain only front or rear facing views. However,...

  4. [4]

    Data 184 In order to test the improving effect of the proposed methods on object detection and185 MOT on panoramic cycling videos, this project built an annotated dataset so that the186 5 Video Length (s) FPS F rames Description Preview Video 1 14.0 30 420 Taken on a road Video 2 11.1 30 333 Taken on a bicycle path Video 3 12.0 30 360 Taken on a road Tabl...

  5. [5]

    Methodology 290 In order to apply computer vision291 models to automated road user 292 11 Figure 7: Three steps of the methodology behaviour analysis of panoramic cycling videos, this project designed and implemented a293 three-step methodology, as shown in Figure 7. 294 As the first step of the methodology, an approach was designed to improve the ap-295 ...

  6. [6]

    Results & Analysis 524 After implementing the proposed methodology, in this chapter, the project will present525 and analyse the results obtained in each step systematically. 526 5.1 Evaluation of the Model for Improving Object Detection on 527 Panoramic Cycling Images 528 The experiments designed in subsection 4.1.5 were implemented on a server equipped ...

  7. [7]

    In other words, it increases the646 upper limit of accuracy that an object detection model can achieve on a panoramic image

    Discussion 642 6.1 Improving Object Detection 643 Although Table 4 indicates that the performance of the proposed method can be also achieved644 by the original models with larger input resolution, once the input resolution is as large as that645 of the original image, the effect can not be replaced anymore. In other words, it increases the646 upper limit...

  8. [8]

    To enable analysis of the complete environment sur-707 35 rounding the rider, this research develops methods for object detection and MOT in panoramic708 equirectangular videos

    Conclusion 704 In the field of cycling safety, most of the existing studies applied computer vision models to ‘tra-705 ditional’ videos with a limited field of view for automated road user behaviour analysis (Ibrahim,706 Haworth, Christie, Cheng & Hailes, 2021). To enable analysis of the complete environment sur-707 35 rounding the rider, this research de...

  9. [9]

    (1997).Public transport and cycling: Experience of modal integration in ger-747 many

    https://doi.org/10.1109/ICIP.2016.7533003 746 36 Brunsing, J. (1997).Public transport and cycling: Experience of modal integration in ger-747 many. from the greening of urban transport(2nd ed.). Wiley (John); Sons, Limited.748 cheind. (2019). Py-motmetrics [Accessed 23 August 2022]. https://github.com/cheind/py-749 motmetrics 750 Daniels, S., Nuyts, E., &...

  10. [10]

    M., & Delmelle, E

    https://doi.org/10.1016/j.aap.2007.07.016 753 Delmelle, E. M., & Delmelle, E. C. (2012). Exploring spatio-temporal commuting patterns754 in a university environment.Transport Policy, 21, 1–9. https://doi.org/10.1016/j.755 tranpol.2011.12.007 756 Deng, F., Zhu, X., & Ren, J. (2017). Object detection on panoramic images based on deep757 learning. 2017 3rd I...

  11. [11]

    796 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning.Nature, 521(7553), 436–444.797 https://doi.org/10.1038/nature14539 798 Lee, G.-W., & Han, J.-K. (2021). Viewport rendering algorithm with a curved surface for799 a wide fov in 360° images. Applied Sciences, 11(3), 1133. https://doi.org/10.3390/800 app11031133 801 Li, J., Liang, X., Wei, Y., Xu,...

  12. [12]

    888 Zuraimi, M. A. B., & Zaman, F. H. K. (2021). Vehicle detection and tracking using yolo889 and deepsort. 2021 IEEE 11th IEEE Symposium on Computer Applications & 890 Industrial Electronics (ISCAIE), 23–29. 891 40