Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

Binay Kumar Singh; Niels da Vitoria Lobo

arxiv: 2606.12628 · v1 · pith:66OMH7PHnew · submitted 2026-06-10 · 💻 cs.CV

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

Binay Kumar Singh , Niels Da Vitoria Lobo This is my paper

Pith reviewed 2026-06-27 09:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords object detectioncontext fusionautonomous drivingattention mechanismsco-occurring objectsCityscapesBDD100Kfeature fusion

0 comments

The pith

Fusing local RoI attention with global top-K object co-occurrence attention improves detection of small and co-occurring objects in driving scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Context-Centric Feature Fusion framework that combines two attention modules to embed relational context into object detections. One module applies RoI-to-RoI self-attention to capture spatial interactions among nearby objects, especially small or occluded ones. The second pools the top-K region features into a single global token that encodes typical co-occurrence patterns without full pixel-level computation. When these local and object-centric signals are merged, the resulting embeddings raise category-level consistency scores and small-object average precision on standard autonomous-driving benchmarks.

Core claim

The fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection, achieving CCS of 0.973 on Cityscapes and 0.969 on BDD100K together with AP_S of 14.1%.

What carries the argument

Context-Centric Feature Fusion (CCFF) that merges a Local Context Fusion Module using RoI-to-RoI self-attention with a Global Context Attention Module that pools top-K RoI features into an attention token.

If this is right

Small-object average precision reaches 14.1 percent while recovering rare classes such as Train.
Relational consistency measured by CCS exceeds 0.97 on both evaluated datasets.
The added modules impose only a 0.2 FPS overhead, preserving real-time operation.
The method avoids full pixel-level global pooling, limiting computational cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-plus-global fusion pattern could be tested on indoor scene datasets where object relations also matter.
If the global token primarily encodes dataset priors, performance would degrade on scenes with atypical object pairings.
Replacing the top-K selection with learned selection might further reduce reliance on fixed co-occurrence counts.

Load-bearing premise

The top-K RoI pooling and RoI-to-RoI self-attention capture genuine relational context between objects rather than dataset-specific co-occurrence statistics.

What would settle it

Measure Category-level Consistency Strategy on a new driving dataset whose object co-occurrence statistics differ markedly from Cityscapes and BDD100K; a substantial drop in CCS would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.12628 by Binay Kumar Singh, Niels da Vitoria Lobo.

**Figure 1.** Figure 1: Detailed schematic flow of the proposed ContextCentric Feature Fusion (CCFF) architecture. Region proposals from the FPN are enhanced using parallel dual-stream contextual reasoning channels. Our (1) the Local Context Fusion Module (LCFM), handles localized spatial object interactions using regional self-attention, (2) the Global Context Attention Module (GCAM), maps global environmental priors by exec… view at source ↗

**Figure 2.** Figure 2: Qualitative visualization of semantic co-occurrence links during inference on Cityscapes. The left panel displays the raw input scene. The right panel illustrates finalized model predictions with our explicit relational logic links. Line colors distinguish discrete category configurations (e.g., red for person ↔ car, green for person ↔ bicycle). Line widths reflect attention confidence, demonstrating how t… view at source ↗

**Figure 3.** Figure 3: Qualitative results illustrating scale robustness in heterogeneous urban driving environments. The left panel represents the raw street environment. The right panel displays the corresponding CCFF inference result. Our model simultaneously maps the co-occurrences by utilizing a highly confident macro landmark (e.g., the 99% confidence train extraction) as a structural anchor. The cooccurrence links (e.g.,… view at source ↗

read the original abstract

Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as "Train" that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at https://github.com/BinayKSingh/CCFF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CCFF packages standard RoI attention and top-K global pooling for driving scenes and reports gains on small/rare objects, but the abstract leaves the source of those gains unclear.

read the letter

The paper's main contribution is the CCFF pipeline that pairs LCFM (RoI-to-RoI self-attention aimed at small and occluded objects) with GCAM (top-K RoI pooling turned into a global attention token). This is evaluated on Cityscapes and BDD100K, where it reaches CCS scores of 0.973 and 0.969, lifts AP_S to 14.1 percent, and recovers some rare classes while adding only 0.2 FPS overhead. The code release is useful for anyone who wants to reproduce or extend the modules.

The work is straightforward and stays within the practical constraints of autonomous-driving perception. The choice to avoid full pixel-level global pooling makes sense for speed, and the focus on co-occurring objects matches a real pain point in crowded road scenes.

The soft spots are in the evidence. The abstract supplies headline numbers but no baseline tables, ablation breakdowns, error bars, or statistical tests, so it is impossible to tell how much the fusion itself drives the improvements versus dataset-specific tuning. The stress-test concern holds weight here: both evaluation sets are urban driving scenes with strong, repeated co-occurrence patterns, and the top-K selection plus pairwise attention directly ingest those patterns. Nothing described distinguishes genuine relational reasoning from memorization of the training distribution. Without tests on shifted layouts or explicit controls, the safety-relevant claim stays provisional.

This is for researchers already working on attention-based detectors for driving or similar constrained domains. A reader looking for a ready-to-try fusion block would find value in the implementation and timing numbers. It deserves peer review because the architecture is clearly motivated and the efficiency result is concrete, even though the current write-up would need more controls and comparisons before acceptance.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Context-Centric Feature Fusion (CCFF) framework for object detection in autonomous driving. It introduces the Local Context Fusion Module (LCFM) employing RoI-to-RoI self-attention for spatial interactions among small or obscured objects and the Global Context Attention Module (GCAM) that pools top-K RoI features into a global attention token to encode object co-occurrence priors. The fusion of these local and object-centric global features is claimed to yield contextualized embeddings that improve classification and co-occurring object detection. Evaluations on Cityscapes and BDD100K report Category-level Consistency Strategy (CCS) scores of 0.973 and 0.969 respectively, an AP_S of 14.1% for small objects, recovery of rare classes such as "Train", and real-time inference with a 0.2 FPS overhead. Code is released at the cited GitHub repository.

Significance. If the claimed gains are shown to arise from relational context rather than dataset co-occurrence statistics, the approach could improve handling of rare classes and small objects in heterogeneous driving scenes while maintaining efficiency. The public code release supports reproducibility and is noted as a strength.

major comments (2)

[Abstract] Abstract: The reported CCS scores (0.973/0.969) and AP_S (14.1%) are presented without any baseline comparisons, ablation results for LCFM/GCAM, error bars, or statistical tests. This prevents verification that the improvements are attributable to the proposed context fusion rather than post-hoc tuning or dataset characteristics.
[Abstract] Abstract (method description): GCAM is described as converting "co-occurrence of objects priors" via top-K RoI pooling, and LCFM uses RoI-to-RoI self-attention; however, no mechanism is shown that distinguishes encoding of true spatial/relational structure from memorization of the highly correlated object layouts present in both Cityscapes and BDD100K. No tests on scenes with altered co-occurrence distributions are reported.

minor comments (2)

[Abstract] The efficiency claim of "0.2 FPS overhead" lacks specification of the base detector, hardware, or input resolution used for the timing measurement.
Notation for the modules (LCFM, GCAM) and the Category-level Consistency Strategy (CCS) metric would benefit from explicit definitions or references in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to enhance clarity and provide supporting evidence where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The reported CCS scores (0.973/0.969) and AP_S (14.1%) are presented without any baseline comparisons, ablation results for LCFM/GCAM, error bars, or statistical tests. This prevents verification that the improvements are attributable to the proposed context fusion rather than post-hoc tuning or dataset characteristics.

Authors: We agree the abstract as a high-level summary omits these supporting details. The full manuscript includes baseline comparisons and ablations for LCFM/GCAM in the experiments section. In revision we will update the abstract to explicitly reference these results, add error bars, and note statistical tests to better attribute gains to the context fusion modules. revision: yes
Referee: [Abstract] Abstract (method description): GCAM is described as converting "co-occurrence of objects priors" via top-K RoI pooling, and LCFM uses RoI-to-RoI self-attention; however, no mechanism is shown that distinguishes encoding of true spatial/relational structure from memorization of the highly correlated object layouts present in both Cityscapes and BDD100K. No tests on scenes with altered co-occurrence distributions are reported.

Authors: LCFM's RoI-to-RoI self-attention explicitly computes pairwise spatial relations among detected objects in a scene, which is independent of dataset-wide co-occurrence frequencies. GCAM's top-K pooling similarly operates on per-image RoI features to form dynamic global tokens. While we did not evaluate on synthetically altered co-occurrence distributions, the observed gains on rare classes (e.g., "Train") and small objects are consistent with relational modeling rather than pure memorization. We will add a clarifying discussion of this distinction in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture evaluated on public benchmarks

full rationale

The paper introduces LCFM (RoI-to-RoI self-attention) and GCAM (top-K RoI pooling) as architectural modules whose outputs are fused for contextual embeddings. All reported gains (CCS 0.973/0.969, AP_S 14.1%) are direct empirical measurements on Cityscapes and BDD100K. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the central claims remain externally falsifiable on the stated datasets without reducing to internal definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method relies on standard attention operations and public datasets.

pith-pipeline@v0.9.1-grok · 5793 in / 1058 out tokens · 14393 ms · 2026-06-27T09:46:07.243112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references

[1]

Lawrence Zitnick, Kavita Bala, and Ross Gir- shick

Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross Gir- shick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2874–2883, 2016. 2

2016
[2]

GCNet: Non-local networks meet squeeze-excitation networks and beyond

Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. GCNet: Non-local networks meet squeeze-excitation networks and beyond. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV) Work- shops, pages 1971–1980, 2019. 2, 3

1971
[3]

End- to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. InProceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2020. 1, 2

2020
[4]

Spatial memory for context reasoning in object detection

Xinlei Chen and Abhinav Gupta. Spatial memory for context reasoning in object detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4087–4096, 2017. 4

2017
[5]

The Cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Scharw¨achter, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016. 4

2016
[6]

Deep learning-enhanced en- vironment perception for autonomous driving: MDNet with CSP-DarkNet53.Pattern Recognition, 160:111174, 2025

Xuyao Guo, Feng Jiang, Quanzhen Chen, Yuxuan Wang, Kaiyue Sha, and Jing Chen. Deep learning-enhanced en- vironment perception for autonomous driving: MDNet with CSP-DarkNet53.Pattern Recognition, 160:111174, 2025. 1

2025
[7]

Relation networks for object detection

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3588–3597, 2018. 2

2018
[8]

ELFT: Efficient local-global fusion transformer for small object detection.PLoS ONE, 20(9):e0332714,

Guoguang Hua, Fangfang Wu, Guangzhao Hao, Chenbo Xia, and Li Li. ELFT: Efficient local-global fusion transformer for small object detection.PLoS ONE, 20(9):e0332714,
[9]

A segmentation net- work for enhancing autonomous driving scene understand- ing using skip connection and adaptive weighting.Scientific Reports, 15(1):36692, 2025

Jiayao Li, Chak Fong Cheang, Xiaoyuan Yu, Suigu Tang, Zhaolong Du, and Qianxiang Cheng. A segmentation net- work for enhancing autonomous driving scene understand- ing using skip connection and adaptive weighting.Scientific Reports, 15(1):36692, 2025. 1

2025
[10]

Visual relationship detection with language priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei- Fei. Visual relationship detection with language priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 852–860, 2016. 4

2016
[11]

The role of context for object detection and semantic segmentation in the wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 891–898, 2014. 1

2014
[12]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 7794–7803, 2018. 2

2018
[13]

BDD100K: A diverse driving dataset for heterogeneous multitask learning

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar- 7 rell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 2636–2645, 2020. 4

2020
[14]

Dynamic local and global context exploration for small object detection

Ziji Zhang, Ping Gong, Haotian Sun, Pingping Wu, and Xu- anyuan Yang. Dynamic local and global context exploration for small object detection. InProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 1–5, 2023. 1, 3

2023
[15]

COCM: Co-occurrence-based consistency matching in domain-adaptive segmentation.Mathematics, 10(23):4468, 2022

Siyu Zhu, Yingjie Tian, Fenfen Zhou, Kunlong Bai, and Xiaoyu Song. COCM: Co-occurrence-based consistency matching in domain-adaptive segmentation.Mathematics, 10(23):4468, 2022. 4

2022
[16]

Deformable DETR: Deformable transform- ers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transform- ers for end-to-end object detection. InInternational Confer- ence on Learning Representations (ICLR), 2021. 2 8

2021

[1] [1]

Lawrence Zitnick, Kavita Bala, and Ross Gir- shick

Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross Gir- shick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2874–2883, 2016. 2

2016

[2] [2]

GCNet: Non-local networks meet squeeze-excitation networks and beyond

Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. GCNet: Non-local networks meet squeeze-excitation networks and beyond. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV) Work- shops, pages 1971–1980, 2019. 2, 3

1971

[3] [3]

End- to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. InProceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2020. 1, 2

2020

[4] [4]

Spatial memory for context reasoning in object detection

Xinlei Chen and Abhinav Gupta. Spatial memory for context reasoning in object detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4087–4096, 2017. 4

2017

[5] [5]

The Cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Scharw¨achter, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016. 4

2016

[6] [6]

Deep learning-enhanced en- vironment perception for autonomous driving: MDNet with CSP-DarkNet53.Pattern Recognition, 160:111174, 2025

Xuyao Guo, Feng Jiang, Quanzhen Chen, Yuxuan Wang, Kaiyue Sha, and Jing Chen. Deep learning-enhanced en- vironment perception for autonomous driving: MDNet with CSP-DarkNet53.Pattern Recognition, 160:111174, 2025. 1

2025

[7] [7]

Relation networks for object detection

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3588–3597, 2018. 2

2018

[8] [8]

ELFT: Efficient local-global fusion transformer for small object detection.PLoS ONE, 20(9):e0332714,

Guoguang Hua, Fangfang Wu, Guangzhao Hao, Chenbo Xia, and Li Li. ELFT: Efficient local-global fusion transformer for small object detection.PLoS ONE, 20(9):e0332714,

[9] [9]

A segmentation net- work for enhancing autonomous driving scene understand- ing using skip connection and adaptive weighting.Scientific Reports, 15(1):36692, 2025

Jiayao Li, Chak Fong Cheang, Xiaoyuan Yu, Suigu Tang, Zhaolong Du, and Qianxiang Cheng. A segmentation net- work for enhancing autonomous driving scene understand- ing using skip connection and adaptive weighting.Scientific Reports, 15(1):36692, 2025. 1

2025

[10] [10]

Visual relationship detection with language priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei- Fei. Visual relationship detection with language priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 852–860, 2016. 4

2016

[11] [11]

The role of context for object detection and semantic segmentation in the wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 891–898, 2014. 1

2014

[12] [12]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 7794–7803, 2018. 2

2018

[13] [13]

BDD100K: A diverse driving dataset for heterogeneous multitask learning

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar- 7 rell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 2636–2645, 2020. 4

2020

[14] [14]

Dynamic local and global context exploration for small object detection

Ziji Zhang, Ping Gong, Haotian Sun, Pingping Wu, and Xu- anyuan Yang. Dynamic local and global context exploration for small object detection. InProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 1–5, 2023. 1, 3

2023

[15] [15]

COCM: Co-occurrence-based consistency matching in domain-adaptive segmentation.Mathematics, 10(23):4468, 2022

Siyu Zhu, Yingjie Tian, Fenfen Zhou, Kunlong Bai, and Xiaoyu Song. COCM: Co-occurrence-based consistency matching in domain-adaptive segmentation.Mathematics, 10(23):4468, 2022. 4

2022

[16] [16]

Deformable DETR: Deformable transform- ers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transform- ers for end-to-end object detection. InInternational Confer- ence on Learning Representations (ICLR), 2021. 2 8

2021