LFA: Layer Feature Attention for Run-Time Introspection of 2D Object Detectors in Automated Driving

Alois Knoll; Mert Keser

arxiv: 2606.00372 · v1 · pith:6XIH6PHPnew · submitted 2026-05-29 · 💻 cs.CV

LFA: Layer Feature Attention for Run-Time Introspection of 2D Object Detectors in Automated Driving

Mert Keser , Alois Knoll This is my paper

Pith reviewed 2026-06-28 22:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords introspectionobject detectionautomated drivingattention mechanismmulti-layer featuresfailure predictionKITTIBDD100K

0 comments

The pith

Layer Feature Attention aggregates multi-layer backbone features via attention to predict object detector failures more accurately than single-layer methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Layer Feature Attention (LFA) to predict when 2D object detectors fail in automated driving scenarios. It shows that existing introspection approaches discard useful signals by relying only on the final layer or hand-crafted statistics. LFA instead learns end-to-end attention weights to combine features from multiple backbone layers, exploiting the fact that low-level layers reveal fine details relevant to small or occluded objects while high-level layers capture semantic context. Experiments on KITTI and BDD100K datasets confirm that this yields higher error-prediction performance across several detector architectures. The result supports safer deployment through more reliable run-time failure detection that can trigger fallbacks.

Core claim

LFA is a lightweight introspection module that inserts an attention mechanism over multiple backbone layers of a detector. The mechanism learns to weight layer features so that the combined representation better indicates upcoming detection errors. Because errors manifest differently at different abstraction levels, the learned weights improve both prediction accuracy and interpretability of which layers matter for failure cases. The method is evaluated end-to-end on standard driving benchmarks and shown to exceed single-layer baselines without architecture-specific redesign.

What carries the argument

Layer Feature Attention (LFA) attention mechanism that computes learned importance weights to aggregate features from multiple backbone layers for failure prediction.

If this is right

LFA enables more accurate triggering of fallback mechanisms or operator alerts during automated driving.
The learned attention weights provide an interpretable signal showing which feature levels best indicate detector failures.
The same lightweight module applies across multiple detector backbones without requiring per-architecture redesign.
Performance gains are demonstrated on both KITTI and BDD100K benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The layer-weighting idea could be tested on related perception tasks such as semantic segmentation or depth estimation in the same driving context.
If the attention weights consistently down-weight certain layers on particular error types, those patterns could guide targeted improvements to detector training or architecture.
Integration of LFA outputs with downstream planning modules might allow the vehicle to adjust its risk model dynamically based on predicted detector reliability.

Load-bearing premise

Detection errors appear in distinct ways across the feature hierarchy, so combining low-level detail layers with high-level semantic layers improves failure prediction over using any single layer.

What would settle it

Run LFA and the single-layer baseline on a new driving dataset or detector and measure whether the multi-layer attention version still produces higher precision-recall or AUC for failure prediction; equal or lower performance would falsify the performance claim.

read the original abstract

Reliable object detection is critical for automated driving, yet even state-of-the-art detectors inevitably make errors that can compromise safety. Introspection methods that predict detector failures enable safer deployment by triggering fallback mechanisms or alerting human operators. However, existing approaches rely solely on last-layer features or hand-crafted statistics, discarding valuable information from earlier layers that capture different levels of visual abstraction. We propose Layer Feature Attention (LFA), a lightweight introspection method that learns to aggregate features from multiple backbone layers through an attention mechanism. Our key insight is that detection errors manifest differently across feature hierarchies: low-level layers capture fine-grained details essential for detecting small or occluded objects, while high-level layers encode semantic information for scene understanding. LFA learns layer importance weights end-to-end, enabling both improved error prediction and interpretable analysis of which feature levels are most indicative of detector failures. Extensive experiments on KITTI and BDD100K demonstrate that LFA achieves state-of-the-art introspection performance, outperforming single-layer baselines across multiple detector architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LFA is a straightforward attention-based extension for multi-layer introspection that could matter for AV safety if the gains over single-layer baselines prove consistent in the experiments.

read the letter

The main thing here is that LFA learns attention weights to combine features from several backbone layers instead of sticking to the last layer or hand-crafted stats for predicting when object detectors fail.

What is new is the end-to-end learned aggregation across the feature hierarchy, plus the claim that this gives both better error prediction and some insight into which layers matter for different failure types. The paper tests this on KITTI and BDD100K with multiple detector architectures, which is a reasonable setup for the automated driving use case, and the lightweight design is a practical plus.

The soft spots are mostly around the strength of the evidence. The abstract makes SOTA claims without numbers or ablation details, so the full paper needs to show clear, reproducible improvements and controls to back that up. The motivating idea that low-level layers catch fine details while high-level ones handle semantics is plausible, but it remains a hypothesis until the results demonstrate that the attention mechanism actually exploits those differences rather than just adding flexibility.

This paper is for researchers working on reliable perception and failure prediction in self-driving systems. A reader focused on practical methods for monitoring detector reliability would get usable ideas from it.

I would send it to peer review. The core approach is coherent and the application area is important enough that referees can sort out whether the gains are meaningful.

Referee Report

0 major / 3 minor

Summary. The paper introduces Layer Feature Attention (LFA), a method that employs an attention mechanism to aggregate features from multiple layers of the backbone network in 2D object detectors. This is used for run-time introspection to predict detection errors. The key claim is that LFA achieves state-of-the-art performance on the KITTI and BDD100K datasets, outperforming single-layer baselines across multiple detector architectures, while also enabling interpretable analysis of layer importance based on the hierarchical manifestation of detection errors.

Significance. If the results are confirmed, this contribution is significant for the field of automated driving as it provides a practical, lightweight approach to improve the reliability of object detection systems by utilizing multi-layer feature information that is typically discarded. The end-to-end learning of layer weights and the focus on interpretability are notable strengths that could aid in understanding and mitigating detector failures.

minor comments (3)

[Abstract] The abstract states that LFA achieves state-of-the-art introspection performance but does not provide any quantitative metrics, specific improvement values, or details on the experimental setup, which makes it challenging to evaluate the strength of the claims.
[§3.1] The motivation for using attention over simple concatenation or averaging of layers could be strengthened with a brief comparison or reference to related multi-layer fusion techniques in the literature.
[Experiments] Ensure that the implementation details, such as the exact backbone layers used and the attention module architecture, are fully specified to allow reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript, recognition of its significance for automated driving, and recommendation for minor revision. We appreciate the acknowledgment of LFA's practical approach, end-to-end learning of layer weights, and focus on interpretability.

Circularity Check

0 steps flagged

No significant circularity; method and claims are empirically grounded

full rationale

The paper introduces LFA as an attention-based aggregation of multi-layer backbone features for failure prediction. The central claim rests on end-to-end training and direct empirical comparison against single-layer baselines on KITTI and BDD100K. The stated insight on hierarchical error manifestations functions as a motivating hypothesis, not an axiom embedded in the architecture or required for correctness. No equations, fitted parameters renamed as predictions, self-citation load-bearing steps, or ansatz smuggling appear in the provided text. The derivation chain is self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5709 in / 967 out tokens · 24661 ms · 2026-06-28T22:27:41.517133+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 4 canonical work pages · 2 internal anchors

[1]

LFA: Layer Feature Attention for Run-Time Introspection of 2D Object Detectors in Automated Driving

INTRODUCTION Accurate perception of the surrounding environment is of paramount importance for the safe operation of automated driving (AD) systems [1]. Within the perception stack, object detection provides instance-level information by identifying and localizing traffic participants such as vehicles, pedestri- ans, and cyclists. Despite substantial prog...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

2.1) and situate our approach within feature-based introspection methods (Sec

RELATED WORK We review introspection methods for object detection (Sec. 2.1) and situate our approach within feature-based introspection methods (Sec. 2.2). 2.1. Introspection for Object Detection Introspection methods for object detection in AD can be broadly categorized according to the type of information they exploit. Confidence-based approaches lever...
[3]

to suppress less informative activations within the ex- tracted layer, thereby improving discriminability. In the context of LiDAR-based 3D object detection, a recent exten- sion [17] investigated the role of activations from different backbone layers and proposed concatenating early, interme- diate, and final layer features for introspection. While this ...
[4]

3.1), and describe the introspection framework for its training and eval- uation (Sec

METHODOLOGY We introduce Layer Feature Attention (LFA) (Sec. 3.1), and describe the introspection framework for its training and eval- uation (Sec. 3.2). 3.1. Layer Feature Attention LFA takes GAP-pooled feature vectors from all backbone lay- ers and learns to aggregate them via a transformer attention mechanism for frame-level error prediction. Layer Pro...

2048
[5]

EXPERIMENTS 4.1. Experimental Setup Datasets.We evaluate our approach on two autonomous driving benchmarks.KITTI[2] provides 7,481 labeled ur- ban driving images with 2D bounding box annotations; since the official test set labels are not publicly available, we fol- low [11] and partition the labeled set into 60%/20%/20% splits for training, validation, a...

work page arXiv 1989
[6]

CONCLUSION We presented Layer Feature Attention (LFA), an introspec- tion method that aggregates features from multiple backbone layers via learned attention to predict object detection er- rors at the frame level. Unlike prior approaches that rely on a single layer or hand-crafted preprocessing, LFA learns to adaptively weight layer contributions, enabli...
[7]

A survey of autonomous driving: Common practices and emerging technologies,

Ekim Yurtsever et al., “A survey of autonomous driving: Common practices and emerging technologies,”IEEE access, vol. 8, pp. 58443–58469, 2020

2020
[8]

Are we ready for autonomous driving? the kitti vision benchmark suite,

Andreas Geiger et al., “Are we ready for autonomous driving? the kitti vision benchmark suite,” in2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3354–3361

2012
[9]

Bdd100k: A diverse driving dataset for heterogeneous multitask learning,

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” inProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, 2020, pp. 2636–2645

2020
[10]

Benchmarking vision foundation models for input monitoring in autonomous driving,

Mert Keser, Halil Ibrahim Orhan, Niki Amini-Naieni, Gesina Schwalbe, Alois Knoll, and Matthias Rottmann, “Benchmarking vision foundation models for input monitoring in autonomous driving,” inProceedings of the British Machine Vision Conference (BMVC). 2025, BMV A Press

2025
[11]

What does really count? estimating relevance of corner cases for semantic seg- mentation in automated driving,

Jasmin Breitenstein et al., “What does really count? estimating relevance of corner cases for semantic seg- mentation in automated driving,” inProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2023, pp. 3991–4000

2023
[12]

Run-time monitoring of machine learning for robotic perception: A survey of emerging trends,

Quazi Marufur Rahman et al., “Run-time monitoring of machine learning for robotic perception: A survey of emerging trends,”IEEE Access, vol. 9, pp. 20067– 20075, 2021

2021
[13]

Artificial Intelligence Act (Regulation (EU) 2024/1689) laying down harmonised rules on arti- ficial intelligence,

“Artificial Intelligence Act (Regulation (EU) 2024/1689) laying down harmonised rules on arti- ficial intelligence,”https://eur-lex.europa. eu/eli/reg/2024/1689/oj, June 2024, Regula- tion of the European Parliament and of the Council of 13 June 2024 (EU AI Act)

2024
[14]

Road vehicles — safety and artificial intelligence,

“Road vehicles — safety and artificial intelligence,” Dec. 2024, Publicly Available Specification (PAS)

2024
[15]

Introspection of dnn- based perception functions in automated driving sys- tems: State-of-the-art and open research challenges,

Hakan Yekta Yatbaz et al., “Introspection of dnn- based perception functions in automated driving sys- tems: State-of-the-art and open research challenges,” IEEE Transactions on Intelligent Transportation Sys- tems, vol. 25, no. 2, pp. 1112–1130, 2023

2023
[16]

Dropout sampling for robust ob- ject detection in open-set conditions,

Dimity Miller et al., “Dropout sampling for robust ob- ject detection in open-set conditions,” in2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 3243–3249

2018
[17]

Run-time introspection of 2d object detection in automated driving systems using learning representations,

Hakan Yekta Yatbaz et al., “Run-time introspection of 2d object detection in automated driving systems using learning representations,”IEEE Transactions on Intelli- gent V ehicles, vol. 9, no. 6, pp. 5033–5046, 2024

2024
[18]

Per-frame map predic- tion for continuous performance monitoring of object detection during deployment,

Quazi Marufur Rahman et al., “Per-frame map predic- tion for continuous performance monitoring of object detection during deployment,” inProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, 2021, pp. 152–160

2021
[19]

Bayesod: A bayesian approach for uncertainty estimation in deep object detectors,

Ali Harakeh et al., “Bayesod: A bayesian approach for uncertainty estimation in deep object detectors,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 87–93

2020
[20]

Fail- ing to learn: Autonomously identifying perception fail- ures for self-driving cars,

Manikandasriram Srinivasan Ramanagopal et al., “Fail- ing to learn: Autonomously identifying perception fail- ures for self-driving cars,”IEEE Robotics and Automa- tion Letters, vol. 3, no. 4, pp. 3860–3867, 2018

2018
[21]

Interpretable model-agnostic plausi- bility verification for 2d object detectors using domain- invariant concept bottleneck models,

Mert Keser et al., “Interpretable model-agnostic plausi- bility verification for 2d object detectors using domain- invariant concept bottleneck models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3891–3900

2023
[22]

Extremely simple activa- tion shaping for out-of-distribution detection.arXiv preprint arXiv:2209.09858,

Andrija Djurisic et al., “Extremely simple activa- tion shaping for out-of-distribution detection,”arXiv preprint arXiv:2209.09858, 2022

work page arXiv 2022
[23]

Multi-layer self-assessment with filtering for 3d object detection in autonomous ve- hicles,

Hakan Yekta Yatbaz et al., “Multi-layer self-assessment with filtering for 3d object detection in autonomous ve- hicles,”ACM Transactions on Intelligent Systems and Technology, vol. 17, no. 1, pp. 1–23, 2025

2025
[24]

Deep residual learning for image recognition,

Kaiming He et al., “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016
[25]

Layer Normalization

Jimmy Lei Ba et al., “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Faster r-cnn: Towards real-time object detection with region proposal networks,

Shaoqing Ren et al., “Faster r-cnn: Towards real-time object detection with region proposal networks,”Ad- vances in neural information processing systems, vol. 28, 2015

2015
[27]

End-to-end object detection with transformers,

Nicolas Carion et al., “End-to-end object detection with transformers,” inEuropean conference on computer vi- sion. Springer, 2020, pp. 213–229

2020

[1] [1]

LFA: Layer Feature Attention for Run-Time Introspection of 2D Object Detectors in Automated Driving

INTRODUCTION Accurate perception of the surrounding environment is of paramount importance for the safe operation of automated driving (AD) systems [1]. Within the perception stack, object detection provides instance-level information by identifying and localizing traffic participants such as vehicles, pedestri- ans, and cyclists. Despite substantial prog...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

2.1) and situate our approach within feature-based introspection methods (Sec

RELATED WORK We review introspection methods for object detection (Sec. 2.1) and situate our approach within feature-based introspection methods (Sec. 2.2). 2.1. Introspection for Object Detection Introspection methods for object detection in AD can be broadly categorized according to the type of information they exploit. Confidence-based approaches lever...

[3] [3]

to suppress less informative activations within the ex- tracted layer, thereby improving discriminability. In the context of LiDAR-based 3D object detection, a recent exten- sion [17] investigated the role of activations from different backbone layers and proposed concatenating early, interme- diate, and final layer features for introspection. While this ...

[4] [4]

3.1), and describe the introspection framework for its training and eval- uation (Sec

METHODOLOGY We introduce Layer Feature Attention (LFA) (Sec. 3.1), and describe the introspection framework for its training and eval- uation (Sec. 3.2). 3.1. Layer Feature Attention LFA takes GAP-pooled feature vectors from all backbone lay- ers and learns to aggregate them via a transformer attention mechanism for frame-level error prediction. Layer Pro...

2048

[5] [5]

EXPERIMENTS 4.1. Experimental Setup Datasets.We evaluate our approach on two autonomous driving benchmarks.KITTI[2] provides 7,481 labeled ur- ban driving images with 2D bounding box annotations; since the official test set labels are not publicly available, we fol- low [11] and partition the labeled set into 60%/20%/20% splits for training, validation, a...

work page arXiv 1989

[6] [6]

CONCLUSION We presented Layer Feature Attention (LFA), an introspec- tion method that aggregates features from multiple backbone layers via learned attention to predict object detection er- rors at the frame level. Unlike prior approaches that rely on a single layer or hand-crafted preprocessing, LFA learns to adaptively weight layer contributions, enabli...

[7] [7]

A survey of autonomous driving: Common practices and emerging technologies,

Ekim Yurtsever et al., “A survey of autonomous driving: Common practices and emerging technologies,”IEEE access, vol. 8, pp. 58443–58469, 2020

2020

[8] [8]

Are we ready for autonomous driving? the kitti vision benchmark suite,

Andreas Geiger et al., “Are we ready for autonomous driving? the kitti vision benchmark suite,” in2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3354–3361

2012

[9] [9]

Bdd100k: A diverse driving dataset for heterogeneous multitask learning,

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” inProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, 2020, pp. 2636–2645

2020

[10] [10]

Benchmarking vision foundation models for input monitoring in autonomous driving,

Mert Keser, Halil Ibrahim Orhan, Niki Amini-Naieni, Gesina Schwalbe, Alois Knoll, and Matthias Rottmann, “Benchmarking vision foundation models for input monitoring in autonomous driving,” inProceedings of the British Machine Vision Conference (BMVC). 2025, BMV A Press

2025

[11] [11]

What does really count? estimating relevance of corner cases for semantic seg- mentation in automated driving,

Jasmin Breitenstein et al., “What does really count? estimating relevance of corner cases for semantic seg- mentation in automated driving,” inProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2023, pp. 3991–4000

2023

[12] [12]

Run-time monitoring of machine learning for robotic perception: A survey of emerging trends,

Quazi Marufur Rahman et al., “Run-time monitoring of machine learning for robotic perception: A survey of emerging trends,”IEEE Access, vol. 9, pp. 20067– 20075, 2021

2021

[13] [13]

Artificial Intelligence Act (Regulation (EU) 2024/1689) laying down harmonised rules on arti- ficial intelligence,

“Artificial Intelligence Act (Regulation (EU) 2024/1689) laying down harmonised rules on arti- ficial intelligence,”https://eur-lex.europa. eu/eli/reg/2024/1689/oj, June 2024, Regula- tion of the European Parliament and of the Council of 13 June 2024 (EU AI Act)

2024

[14] [14]

Road vehicles — safety and artificial intelligence,

“Road vehicles — safety and artificial intelligence,” Dec. 2024, Publicly Available Specification (PAS)

2024

[15] [15]

Introspection of dnn- based perception functions in automated driving sys- tems: State-of-the-art and open research challenges,

Hakan Yekta Yatbaz et al., “Introspection of dnn- based perception functions in automated driving sys- tems: State-of-the-art and open research challenges,” IEEE Transactions on Intelligent Transportation Sys- tems, vol. 25, no. 2, pp. 1112–1130, 2023

2023

[16] [16]

Dropout sampling for robust ob- ject detection in open-set conditions,

Dimity Miller et al., “Dropout sampling for robust ob- ject detection in open-set conditions,” in2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 3243–3249

2018

[17] [17]

Run-time introspection of 2d object detection in automated driving systems using learning representations,

Hakan Yekta Yatbaz et al., “Run-time introspection of 2d object detection in automated driving systems using learning representations,”IEEE Transactions on Intelli- gent V ehicles, vol. 9, no. 6, pp. 5033–5046, 2024

2024

[18] [18]

Per-frame map predic- tion for continuous performance monitoring of object detection during deployment,

Quazi Marufur Rahman et al., “Per-frame map predic- tion for continuous performance monitoring of object detection during deployment,” inProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, 2021, pp. 152–160

2021

[19] [19]

Bayesod: A bayesian approach for uncertainty estimation in deep object detectors,

Ali Harakeh et al., “Bayesod: A bayesian approach for uncertainty estimation in deep object detectors,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 87–93

2020

[20] [20]

Fail- ing to learn: Autonomously identifying perception fail- ures for self-driving cars,

Manikandasriram Srinivasan Ramanagopal et al., “Fail- ing to learn: Autonomously identifying perception fail- ures for self-driving cars,”IEEE Robotics and Automa- tion Letters, vol. 3, no. 4, pp. 3860–3867, 2018

2018

[21] [21]

Interpretable model-agnostic plausi- bility verification for 2d object detectors using domain- invariant concept bottleneck models,

Mert Keser et al., “Interpretable model-agnostic plausi- bility verification for 2d object detectors using domain- invariant concept bottleneck models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3891–3900

2023

[22] [22]

Extremely simple activa- tion shaping for out-of-distribution detection.arXiv preprint arXiv:2209.09858,

Andrija Djurisic et al., “Extremely simple activa- tion shaping for out-of-distribution detection,”arXiv preprint arXiv:2209.09858, 2022

work page arXiv 2022

[23] [23]

Multi-layer self-assessment with filtering for 3d object detection in autonomous ve- hicles,

Hakan Yekta Yatbaz et al., “Multi-layer self-assessment with filtering for 3d object detection in autonomous ve- hicles,”ACM Transactions on Intelligent Systems and Technology, vol. 17, no. 1, pp. 1–23, 2025

2025

[24] [24]

Deep residual learning for image recognition,

Kaiming He et al., “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016

[25] [25]

Layer Normalization

Jimmy Lei Ba et al., “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [26]

Faster r-cnn: Towards real-time object detection with region proposal networks,

Shaoqing Ren et al., “Faster r-cnn: Towards real-time object detection with region proposal networks,”Ad- vances in neural information processing systems, vol. 28, 2015

2015

[27] [27]

End-to-end object detection with transformers,

Nicolas Carion et al., “End-to-end object detection with transformers,” inEuropean conference on computer vi- sion. Springer, 2020, pp. 213–229

2020