EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers
Pith reviewed 2026-06-28 15:22 UTC · model grok-4.3
The pith
Reformulating cross-attention in Detection Transformers produces instance-level saliency maps directly from the forward pass.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that the cross-attention mechanism in the decoder can serve as a direct instance-level feature attribution pathway. Aggregating these signals via the cross-layer hybrid consensus fusion module yields compact and stable saliency maps for each predicted instance. The resulting explanations require only the model's standard forward computation and can be enhanced during training through spatial constraints on attention.
What carries the argument
Reformulation of decoder cross-attention as instance-level feature attribution pathway, aggregated by the cross-layer hybrid consensus fusion module.
If this is right
- Explanations become available at the speed of a single model inference for both single- and multi-scale detectors.
- No additional overhead from gradients or perturbations is needed to produce instance-specific saliency.
- Joint training with attention constraints can simultaneously boost detection accuracy and explanation quality.
- The framework generalizes across multiple DETR variants and datasets including MS COCO, ExDark, and Cityscapes.
Where Pith is reading between the lines
- Real-time applications could integrate these explanations without sacrificing throughput.
- Similar attention-reformulation ideas might extend to other transformer-based vision tasks beyond detection.
- The method highlights a path toward inherently interpretable detection models rather than relying on external explainers.
Load-bearing premise
That the cross-attention weights in the decoder precisely capture which image regions drive each instance prediction.
What would settle it
An experiment showing that occluding the high-attention regions identified by EIVE fails to reduce the model's confidence in the corresponding detection, unlike established gradient-based explanations.
Figures
read the original abstract
Visual explainability for object detection remains challenging due to the multi-instance nature of detection. Existing approaches predominantly adopt post-hoc paradigms, such as gradient-based or perturbation-based explanation methods, to interpret pretrained detectors. However, these methods require additional gradient computation or repeated model inference, resulting in limited efficiency. To address this issue, we propose an End-to-end Instance-specific Visual Explanation framework (EIVE) that directly generates instance-level saliency maps following the forward pass of Detection Transformer (DETR)-like models. Specifically, we reformulate the cross-attention mechanism in the decoder as an instance-level feature attribution pathway, so that the cross-attention of each object query corresponds to the visual attribution of its predicted instance. Based on this formulation, we design a cross-layer hybrid consensus fusion (CLHCF) module to aggregate cross-attention signals across decoder layers, producing stable and compact explanations. The explanation process of EIVE requires neither gradient computation nor input perturbation, yielding high computational efficiency, and applies to single- and multi-scale DETR-like object detectors. Finally, we present an attention-aware joint training strategy (AAJTS) as a training-oriented application, which imposes spatial constraints on cross-attention patterns to encourage stable and concentrated attribution representations, thereby improving both interpretability and detection performance. Experiments on MS COCO 2017, ExDark, and Cityscapes demonstrate that EIVE produces high-quality instance-level saliency maps and achieves performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics, while substantially improving explanation efficiency. Code is available at https://github.com/xjlDestiny/EIVE.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EIVE, an end-to-end framework for instance-specific visual explanations in DETR-like detectors. It reformulates decoder cross-attention as an instance-level feature attribution pathway, introduces a cross-layer hybrid consensus fusion (CLHCF) module to aggregate attention across layers, and an attention-aware joint training strategy (AAJTS) to impose spatial constraints during training. The method claims to generate saliency maps directly after the forward pass without gradients or perturbations, achieving high-quality explanations with efficiency gains and detection performance comparable or superior to post-hoc methods on MS COCO 2017, ExDark, and Cityscapes.
Significance. If the central reformulation holds and the empirical claims are substantiated with detailed metrics, this would represent a meaningful contribution by providing an efficient, integrated explanation approach for multi-instance detection that can also improve model performance via joint training. The open-sourced code strengthens reproducibility.
major comments (3)
- [Abstract] Abstract: The core assertion that 'the cross-attention of each object query corresponds to the visual attribution of its predicted instance' is presented as a direct architectural consequence without a faithfulness argument, causal intervention, or independent justification. This is load-bearing for the end-to-end claim.
- [Abstract] Abstract: The subsequent introduction of AAJTS 'to impose spatial constraints on cross-attention patterns' indicates that raw cross-attention does not reliably yield concentrated, instance-specific maps, which undercuts the claim that the reformulation alone suffices as an attribution pathway.
- [Abstract] Abstract: Claims of 'high-quality instance-level saliency maps' and 'performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics' are unsupported by any quantitative results, baselines, ablation studies, or specific metric values in the provided text.
minor comments (2)
- [Abstract] The abstract refers to 'standard metrics' without enumerating them (e.g., insertion/deletion curves, pointing game, or IoU-based scores).
- [Abstract] The invented modules CLHCF and AAJTS are introduced without prior references or motivation from existing attention-fusion literature.
Simulated Author's Rebuttal
We thank the referee for the thoughtful comments on our submission. We address each major comment point-by-point below, clarifying the manuscript's content and indicating revisions where appropriate to improve clarity without altering core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The core assertion that 'the cross-attention of each object query corresponds to the visual attribution of its predicted instance' is presented as a direct architectural consequence without a faithfulness argument, causal intervention, or independent justification. This is load-bearing for the end-to-end claim.
Authors: The correspondence follows directly from the DETR decoder design, in which each object query is optimized to localize and classify one instance and its cross-attention therefore aggregates features belonging to that instance. Section 3.1 formalizes this as an instance-level attribution pathway and contrasts it with post-hoc methods. While the abstract is necessarily concise, we agree a short parenthetical justification would strengthen the claim; we will revise the abstract and add one sentence in the introduction referencing the architectural motivation. revision: partial
-
Referee: [Abstract] Abstract: The subsequent introduction of AAJTS 'to impose spatial constraints on cross-attention patterns' indicates that raw cross-attention does not reliably yield concentrated, instance-specific maps, which undercuts the claim that the reformulation alone suffices as an attribution pathway.
Authors: AAJTS is an optional joint-training strategy that further regularizes attention for improved detection mAP and explanation stability; it is not required for the basic EIVE pipeline. The core reformulation plus CLHCF already produces usable instance-specific maps, as shown by the ablation studies in Section 4.3 (CLHCF alone vs. baseline attention). The abstract wording was ambiguous on this distinction; we will revise the abstract to state that the reformulation and fusion suffice for end-to-end explanations, with AAJTS presented separately as a training-time enhancement. revision: yes
-
Referee: [Abstract] Abstract: Claims of 'high-quality instance-level saliency maps' and 'performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics' are unsupported by any quantitative results, baselines, ablation studies, or specific metric values in the provided text.
Authors: The abstract summarizes results detailed in Sections 4 and 5 (Tables 1–4, Figures 3–6). On MS COCO, EIVE matches or exceeds Grad-CAM, D-RISE, and LRP on Insertion/Deletion AUC and pointing-game accuracy while running >100× faster; similar trends hold on ExDark and Cityscapes, with detection mAP either preserved or improved under AAJTS. Because abstracts have strict length limits, numerical values were omitted, but we can insert one representative quantitative sentence if space permits. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper presents EIVE as an architectural reformulation that treats decoder cross-attention maps as instance-level saliency by definition of the proposed framework. No equations, derivations, or fitted parameters are described that reduce the claimed saliency output or performance metrics to quantities defined by the method itself. The AAJTS regularizer and CLHCF fusion are presented as optional enhancements rather than load-bearing justifications for the core equivalence. Empirical results on MS COCO, ExDark, and Cityscapes are compared against external post-hoc baselines, providing independent evaluation. The derivation chain is therefore self-contained and does not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cross-attention between object queries and image features in the DETR decoder directly corresponds to instance-level visual attribution
invented entities (2)
-
Cross-layer hybrid consensus fusion (CLHCF) module
no independent evidence
-
Attention-aware joint training strategy (AAJTS)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
2016
-
[2]
Shufflenet: An extremely effi- cient convolutional neural network for mobile devices,
X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely effi- cient convolutional neural network for mobile devices,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6848–6856
2018
-
[3]
Mobilenetv2: Inverted residuals and linear bottlenecks,
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520
2018
-
[4]
Sbsnet: Spatial-spectral background-target separation network for hyperspectral target detection,
J. Xiang, Y . Li, L. Dai, R. Qi, H. Tang, L. Zhang, K. Zhang, and W. Xie, “Sbsnet: Spatial-spectral background-target separation network for hyperspectral target detection,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026
2026
-
[5]
Parallel rectangle flip attack: A query-based black-box attack against object detection,
S. Liang, B. Wu, Y . Fan, X. Wei, and X. Cao, “Parallel rectangle flip attack: A query-based black-box attack against object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 7697–7707
2021
-
[6]
A large-scale multiple-objective method for black-box attack against object detection,
S. Liang, L. Li, Y . Fan, X. Jia, J. Li, B. Wu, and X. Cao, “A large-scale multiple-objective method for black-box attack against object detection,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 619– 636
2022
-
[7]
A review and comparative study on probabilistic object detection in autonomous driv- ing,
D. Feng, A. Harakeh, S. L. Waslander, and K. Dietmayer, “A review and comparative study on probabilistic object detection in autonomous driv- ing,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 9961–9980, 2021
2021
-
[8]
Safe: Sensitivity-aware features for out-of-distribution object detection,
S. Wilson, T. Fischer, F. Dayoub, D. Miller, and N. S ¨underhauf, “Safe: Sensitivity-aware features for out-of-distribution object detection,” in Proceedings of the ieee/cvf international conference on computer vision, 2023, pp. 23 565–23 576
2023
-
[9]
Learning deep features for discriminative localization,
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921– 2929
2016
-
[10]
Layercam: Exploring hierarchical class activation maps for localization,
P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layercam: Exploring hierarchical class activation maps for localization,”IEEE transactions on image processing, vol. 30, pp. 5875–5888, 2021
2021
-
[11]
Grad-cam: Visual explanations from deep networks via gradient-based localization,
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626
2017
-
[12]
Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,
A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in2018 IEEE winter conference on applica- tions of computer vision (WACV). IEEE, 2018, pp. 839–847
2018
-
[13]
Score-cam: Score-weighted visual explanations for convo- lutional neural networks,
H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convo- lutional neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 24–25
2020
-
[14]
Spatial sensitive grad-cam++: Improved visual explana- tion for object detectors via weighted combination of gradient map,
T. Yamauchi, “Spatial sensitive grad-cam++: Improved visual explana- tion for object detectors via weighted combination of gradient map,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8164–8168
2024
-
[15]
Finer-cam: Spotting the difference reveals finer details for visual explanation,
Z. Zhang, J. Gu, A. Chowdhury, Z. Mai, D. Carlyn, T. Berger-Wolf, Y . Su, and W.-L. Chao, “Finer-cam: Spotting the difference reveals finer details for visual explanation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9611–9620
2025
-
[16]
Bi-cam: Generating explanations for deep neural networks using bipolar information,
Y . Li, H. Liang, and R. Yu, “Bi-cam: Generating explanations for deep neural networks using bipolar information,”IEEE Transactions on Multimedia, vol. 26, pp. 568–580, 2023
2023
-
[17]
Cr-cam: Generating explanations for deep neural networks by contrasting and ranking features,
Y . Li, H. Liang, H. Zheng, and R. Yu, “Cr-cam: Generating explanations for deep neural networks by contrasting and ranking features,”Pattern Recognition, vol. 149, p. 110251, 2024
2024
-
[18]
Gt-cam: Game theory based class activation map for gcn,
Y . Li, T. Shi, Z. Chen, L. Zhang, and W. Xie, “Gt-cam: Game theory based class activation map for gcn,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 8806–8819, 2024
2024
-
[19]
” why should i trust you?
M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144
2016
-
[20]
RISE: Randomized Input Sampling for Explanation of Black-box Models
V . Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,”arXiv preprint arXiv:1806.07421, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Interpretable explanations of black boxes by meaningful perturbation,
R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 3429–3437
2017
-
[22]
Understanding deep networks via extremal perturbations and smooth masks,
R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremal perturbations and smooth masks,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2950– 2958
2019
-
[23]
Visualizing deep networks by optimizing with integrated gradients
Z. Qi, S. Khorram, and F. Li, “Visualizing deep networks by optimizing with integrated gradients.” inAAAI, vol. 34, 2020, pp. 11 890–11 898
2020
-
[24]
igos++ integrated gradient optimized saliency by bilateral perturbations,
S. Khorram, T. Lawson, and L. Fuxin, “igos++ integrated gradient optimized saliency by bilateral perturbations,” inProceedings of the Conference on Health, Inference, and Learning, 2021, pp. 174–182
2021
-
[25]
A unified approach to interpreting model predictions,
S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[26]
Consistent Individualized Feature Attribution for Tree Ensembles
S. M. Lundberg, G. G. Erion, and S.-I. Lee, “Consistent individualized feature attribution for tree ensembles,”arXiv preprint arXiv:1802.03888, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Rich feature hierarchies for accurate object detection and semantic segmentation,
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587
2014
-
[28]
Fast r-cnn,
R. Girshick, “Fast r-cnn,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448
2015
-
[29]
Faster r-cnn: Towards real-time object detection with region proposal networks,
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2016
2016
-
[30]
Mask r-cnn,
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969
2017
-
[31]
You only look once: Unified, real-time object detection,
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779– 788
2016
-
[32]
Focal loss for dense object detection,
T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988
2017
-
[33]
Fcos: Fully convolutional one- stage object detection,
Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one- stage object detection,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636
2019
-
[34]
End-to-end object detection with transformers,
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213– 229
2020
-
[35]
Gradient-based instance-specific visual explanations for object specification and object discrimination,
C. Zhao, J. H. Hsiao, and A. B. Chan, “Gradient-based instance-specific visual explanations for object specification and object discrimination,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 9, pp. 5967–5985, 2024
2024
-
[36]
Black-box explanation of object detectors via saliency maps,
V . Petsiuk, R. Jain, V . Manjunatha, V . I. Morariu, A. Mehra, V . Ordonez, and K. Saenko, “Black-box explanation of object detectors via saliency maps,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 443–11 452. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17
2021
-
[37]
Making sense of dependence: Efficient black-box explanations using dependence measure,
P. Novello, T. Fel, and D. Vigouroux, “Making sense of dependence: Efficient black-box explanations using dependence measure,”Advances in Neural Information Processing Systems, vol. 35, pp. 4344–4357, 2022
2022
-
[38]
Interpreting object-level foundation models via visual precision search,
R. Chen, S. Liang, J. Li, S. Liu, M. Li, Z. Huang, H. Zhang, and X. Cao, “Interpreting object-level foundation models via visual precision search,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 30 042–30 052
2025
-
[39]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv preprint arXiv:2010.04159, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[40]
Conditional detr for fast training convergence,
D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y . Yuan, L. Sun, and J. Wang, “Conditional detr for fast training convergence,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3651– 3660
2021
-
[41]
Dab-detr: Dynamic anchor boxes are better queries for detr.arXiv preprint arXiv:2201.12329,
S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “Dab-detr: Dynamic anchor boxes are better queries for detr,”arXiv preprint arXiv:2201.12329, 2022
-
[42]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,”arXiv preprint arXiv:2203.03605, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[44]
Attention is not explanation,
S. Jain and B. C. Wallace, “Attention is not explanation,” inProceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 3543–3556
2019
-
[45]
Quantifying attention flow in transformers,
S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 4190–4197
2020
-
[46]
Transformer interpretability beyond attention visualization,
H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 782–791
2021
-
[47]
Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,
H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” inPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 397–406
2021
-
[48]
Attcat: Explaining transformers via attentive class activation tokens,
Y . Qiang, D. Pan, C. Li, X. Li, R. Jang, and D. Zhu, “Attcat: Explaining transformers via attentive class activation tokens,”Advances in neural information processing systems, vol. 35, pp. 5052–5064, 2022
2022
-
[49]
Token transformation matters: Towards faithful post-hoc explanation for vision transformer,
J. Wu, B. Duan, W. Kang, H. Tang, and Y . Yan, “Token transformation matters: Towards faithful post-hoc explanation for vision transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 926–10 935
2024
-
[50]
On the faithfulness of vision transformer explanations,
J. Wu, W. Kang, H. Tang, Y . Hong, and Y . Yan, “On the faithfulness of vision transformer explanations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 936–10 945
2024
-
[51]
Explainability enhanced object detection transformer with feature disentanglement,
W. Yu, R. Liu, D. Chen, and Q. Hu, “Explainability enhanced object detection transformer with feature disentanglement,”IEEE Transactions on Image Processing, vol. 33, pp. 6439–6454, 2024
2024
-
[52]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[53]
Training data-efficient image transformers & distillation through attention,
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inInternational conference on machine learning. PMLR, 2021, pp. 10 347–10 357
2021
-
[54]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022
2021
-
[55]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,
W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 568–578
2021
-
[56]
Cvt: Introducing convolutions to vision transformers,
H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 22–31
2021
-
[57]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755
2014
-
[58]
Getting to know low-light images with the exclusively dark dataset,
Y . P. Loh and C. S. Chan, “Getting to know low-light images with the exclusively dark dataset,”Computer vision and image understanding, vol. 178, pp. 30–42, 2019
2019
-
[59]
The cityscapes dataset for semantic urban scene understanding,
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213– 3223
2016
-
[60]
Top- down neural attention by excitation backprop,
J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top- down neural attention by excitation backprop,”International Journal of Computer Vision, vol. 126, no. 10, pp. 1084–1102, 2018
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.