Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information

Injung Kim; Jinyoung Choi; Youngchae Kwon

arxiv: 2512.23221 · v1 · submitted 2025-12-29 · 💻 cs.CV · cs.AI

Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information

Youngchae Kwon , Jinyoung Choi , Injung Kim This is my paper

Pith reviewed 2026-05-16 19:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords fashion item detectionDETRcontextual informationobject detectionco-occurrencespatial relationshipsbody keypointsholistic detection

0 comments

The pith

Holi-DETR lifts fashion item detection accuracy by folding co-occurrence, spatial layout, and body-keypoint signals into the DETR framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard object detectors miss relationships among clothing pieces in outfit photos, so they confuse similar items. Holi-DETR adds three explicit signals: how often items appear together, their typical relative sizes and positions, and where they sit relative to body keypoints. These signals are merged directly into the transformer layers so the model reasons about the whole outfit at once. The result is a measurable rise in average precision on fashion datasets. Readers care because reliable item detection underpins virtual try-on, search, and inventory tools that currently struggle with visual ambiguity.

Core claim

Holi-DETR detects fashion items holistically by embedding three contextual signals into the DETR architecture: co-occurrence probabilities between items, relative position and size from inter-item arrangements, and spatial links between items and human body keypoints. This integration reduces subcategory confusion that independent per-item detectors cannot resolve. Experiments report gains of 3.6 percentage points AP over vanilla DETR and 1.1 points over Co-DETR on the evaluated fashion data.

What carries the argument

The Holi-DETR architecture that fuses co-occurrence, inter-item spatial, and item-to-body-keypoint contextual features into the DETR detection pipeline.

If this is right

Fashion detectors become less prone to subcategory swaps when items share visual traits.
Outfit-level consistency improves because the model respects typical spatial and co-occurrence patterns.
Existing DETR and Co-DETR checkpoints can be upgraded by the same context modules without full retraining.
Body-keypoint alignment supplies an extra cue that pure appearance-based methods lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-signal pattern could transfer to scene understanding tasks where objects have stable spatial and co-occurrence rules with people.
If the context modules prove robust, they might lower the labeled-data requirement for new visual domains.
Performance on images with unusual body poses or non-standard outfits would expose the limits of the learned relationships.

Load-bearing premise

The three contextual signals can be added to DETR without creating dataset-specific biases or demanding retuning that breaks on other image types.

What would settle it

Apply the same architecture and signals to a standard object-detection benchmark outside fashion and measure whether average precision rises, stays flat, or drops.

read the original abstract

Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Holi-DETR folds co-occurrence, spatial layout, and body-keypoint signals into DETR and records small but consistent AP gains on fashion images, backed by ablations.

read the letter

The paper's main point is straightforward: adding three specific context types to DETR improves fashion item detection by 3.6 pp AP over vanilla DETR and 1.1 pp over Co-DETR. The authors inject context embeddings into the decoder queries and cross-attention, and they run controlled comparisons that keep the backbone and training protocol fixed across baselines. Ablations that isolate each context source (co-occurrence, inter-item spatial, item-to-keypoint) make it possible to see which signals drive the lift. That setup is cleaner than many DETR variants I have seen lately. The numbers line up internally and the integration description is detailed enough for someone to reimplement it. The gains are modest, which fits the fact that DETR already models a lot of relational structure; the extra context mostly cleans up ambiguities that are common in outfit photos. The clearest limitation is the narrow test domain. Everything is evaluated on fashion images, so it is unclear how much of the fusion would transfer to general scenes or other relational detection problems. No error analysis or cross-dataset results are mentioned, which leaves the generalization question open. This is useful work for groups already focused on fashion retrieval or fine-grained detection who need a stronger starting point. It is not a theoretical advance, but the experimental controls are solid enough that a referee could give targeted feedback on scaling the context fusion and on whether the same signals help outside fashion. I would send it to peer review.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Holi-DETR, a Detection Transformer variant that incorporates three contextual signals—co-occurrence probabilities between fashion items, inter-item spatial relations (relative position and size), and item-to-human-keypoint spatial relations—into the decoder queries and cross-attention to enable holistic detection of fashion items in outfit images. The central claim is that this integration produces AP gains of 3.6 percentage points over vanilla DETR and 1.1 percentage points over Co-DETR under controlled comparisons using the same backbone and training protocol, supported by ablations isolating each context type.

Significance. If the gains are reproducible, the work supplies concrete evidence that domain-specific contextual priors can improve transformer detectors on structured scenes without altering the core architecture. The controlled experimental design, including ablations that isolate each of the three signals and direct comparisons against both vanilla DETR and Co-DETR, strengthens the attribution of improvements to the added context rather than training artifacts.

minor comments (3)

[Abstract] The abstract reports specific AP deltas but omits the dataset name, size, and class count; moving these statistics to the abstract or the first paragraph of the introduction would improve immediate readability.
[Method] A single diagram showing the precise injection points of the three context embeddings into the DETR decoder (queries and cross-attention) would clarify the architecture description and aid reproduction.
[Experiments] The experimental section should report whether the AP gains are averaged over multiple random seeds or include standard deviations to confirm stability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of Holi-DETR, the accurate summary of our contributions, and the recommendation for minor revision. The controlled comparisons and ablations isolating each contextual signal are indeed central to attributing the reported AP gains. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core contribution is an architectural modification to DETR that injects three heterogeneous context embeddings (co-occurrence probabilities, inter-item spatial relations, and item-to-keypoint relations) into the decoder queries and cross-attention layers. All reported results are empirical AP gains measured on held-out test splits against external baselines (vanilla DETR and Co-DETR) under identical backbone and training protocols, with ablations that isolate each added signal. No equation, parameter, or performance metric is defined in terms of itself or recovered by construction from the same fitted values; the derivation chain consists of standard transformer components plus explicit context encoders whose outputs are independent of the final detection scores.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard DETR transformer assumptions plus the domain premise that fashion items exhibit stable co-occurrence and spatial patterns; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

standard math DETR architecture and training assumptions hold for the augmented model
The paper extends DETR without altering its core mathematical formulation.
domain assumption Fashion images contain reliable co-occurrence and spatial relationships usable as context
Central premise stated in the abstract for why context helps.

pith-pipeline@v0.9.0 · 5550 in / 1212 out tokens · 25943 ms · 2026-05-16T19:51:05.395693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

[1]

Expert Systems with Applications 116, 328 – 339 (2019)

Yian, S., Kyungshik, S.: Hierarchical convolutional neural networks for fashion image classification. Expert systems with applications116, 328–339 (2019) https: //doi.org/10.1016/j.eswa.2018.09.022

work page doi:10.1016/j.eswa.2018.09.022 2019
[2]

Multimedia Tools and Applications82(5), 7383–7400 (2023) https://doi.org/10.1007/s11042-022-13424-8

Tian, Q., Chanda, S., Gray, D.: Improving apparel detection with category grouping and multi-grained branches. Multimedia Tools and Applications82(5), 7383–7400 (2023) https://doi.org/10.1007/s11042-022-13424-8

work page doi:10.1007/s11042-022-13424-8 2023
[3]

Preprint at https://arxiv.org/abs/2111.00905 15 (2021)

Mohammadi, S.O., Kalhor, A.: Smart fashion: a review of AI applications in the Fashion & Apparel Industry. Preprint at https://arxiv.org/abs/2111.00905 15 (2021)

work page arXiv 2021
[4]

Sensors23(13), 6083 (2023) https://doi.org/10.3390/s23136083

Ma, B., Xu, W.: Efficient fine tuning for fashion object detection. Sensors23(13), 6083 (2023) https://doi.org/10.3390/s23136083

work page doi:10.3390/s23136083 2023
[5]

In: The World Wide Web Conference, pp

Cui, Z., Li, Z., Wu, S., Zhang, X.-Y., Wang, L.: Dressing as a whole: Outfit com- patibility learning based on node-wise graph neural networks. In: The World Wide Web Conference, pp. 307–317 (2019). https://doi.org/10.1145/3308558.3313444

work page doi:10.1145/3308558.3313444 2019
[6]

Schmon, and Chris G

Sarkar, R., Bodla, N., Vasileva, M., Lin, Y.-L., Beniwal, A., Lu, A., Medioni, G.: Outfittransformer: Outfit representations for fashion recommendation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2263–2267 (2022). https://doi.org/10.1109/cvprw56347.2022. 00249

work page doi:10.1109/cvprw56347.2022 2022
[7]

In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp

Lin, Y.-L., Tran, S., Davis, L.S.: Fashion outfit complementary item retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pp. 3311–3319 (2020). https://doi.org/10.1109/cvpr42600.2020. 00337

work page doi:10.1109/cvpr42600.2020 2020
[8]

MoCoGAN: Decomposing motion and content for video generation

Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: An image-based virtual try- on network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7543–7552 (2018). https://doi.org/10.1109/cvpr.2018. 00787

work page doi:10.1109/cvpr.2018 2018
[9]

Wheat, M

Islam, T., Miron, A., Liu, X., Li, Y.: Deep learning in virtual try-on: A comprehensive survey. IEEE Access (2024) https://doi.org/10.1109/access.2024. 3368612

work page doi:10.1109/access.2024 2024
[10]

IEEE (pp

Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014). https://doi.org/10.1109/CVPR.2014.81

work page doi:10.1109/cvpr.2014.81 2014
[11]

In: 2015 IEEE International Conference on Computer Vision (ICCV), pp

Girshick, R.: Fast r-cnn. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015). https://doi.org/10.1109/ICCV.2015.169

work page doi:10.1109/iccv.2015.169 2015
[12]

IEEE transactions on pattern analy- sis and machine intelligence39(6), 1137–1149 (2016) https://doi.org/10.1109/ TPAMI.2016.2577031

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analy- sis and machine intelligence39(6), 1137–1149 (2016) https://doi.org/10.1109/ TPAMI.2016.2577031

work page arXiv 2016
[13]

In: Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceed- ings, Part I 14, pp

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceed- ings, Part I 14, pp. 21–37 (2016). https://doi.org/10.1007/978-3-319-46448-0 2 . Springer 16

work page doi:10.1007/978-3-319-46448-0 2016
[14]

Communications of the ACM 65(1), 99–106 (2021) https://doi.org/ 10.1007/978-3-030-58452-8 24

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Com- puter Vision, pp. 213–229 (2020). https://doi.org/10.1007/978-3-030-58452-8 13 . Springer

work page doi:10.1007/978-3-030-58452-8 2020
[15]

In: International Con- ference on Learning Representations (2021)

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. In: International Con- ference on Learning Representations (2021). https://openreview.net/forum?id= gZ9hCDWe6ke

work page 2021
[16]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional detr for fast training convergence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3651–3660 (2021). https://doi. org/10.1109/ICCV48922.2021.00363

work page doi:10.1109/iccv48922.2021.00363 2021
[17]

Advances in Neural Information Pro- cessing Systems35, 34899–34911 (2022) https://doi.org/10.48550/arXiv.2205

Tian, Z., Chu, X., Wang, X., Wei, X., Shen, C.: Fully convolutional one-stage 3d object detection on lidar range images. Advances in Neural Information Pro- cessing Systems35, 34899–34911 (2022) https://doi.org/10.48550/arXiv.2205. 13764

work page doi:10.48550/arxiv.2205 2022
[18]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp

Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 13619–13627 (2022). https://doi.org/10.1109/tpami.2023.3335410

work page doi:10.1109/tpami.2023.3335410 2022
[19]

Dickerson

Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: DAB- DETR: Dynamic anchor boxes are better queries for DETR. In: International Conference on Learning Representations (2022). https://doi.org/10.48550/arXiv. 2201.12329 . https://openreview.net/forum?id=oMI9PjOb9Jl

work page internal anchor Pith review doi:10.48550/arxiv 2022
[20]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.-Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In: The Eleventh International Conference on Learning Representations (2022). https://doi.org/10.48550/arXiv.2203.03605

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.03605 2022
[21]

Preprint at https://arxiv.org/abs/2306.04670 (2023)

Shehzadi, T., Hashmi, K.A., Stricker, D., Afzal, M.Z.: 2d object detection with transformers: a review. Preprint at https://arxiv.org/abs/2306.04670 (2023). https://doi.org/10.48550/arXiv.2306.04670

work page doi:10.48550/arxiv.2306.04670 2023
[22]

In: European Conference on Computer Vision, pp

Hou, X., Liu, M., Zhang, S., Wei, P., Chen, B., Lan, X.: Relation detr: Exploring explicit position relation prior for object detection. In: European Conference on Computer Vision, pp. 89–105 (2025). https://doi.org/10.1007/ 978-3-031-72973-7 6 . Springer

work page 2025
[23]

CCCV 2015 Comput

Lao, B., Jagadeesh, K.: Convolutional neural networks for fashion classification and object detection. CCCV 2015 Comput. Vis546, 120–129 (2015) 17

work page 2015
[24]

In: 2018 IEEE 4th International Conference on Computer and Communications (ICCC), pp

Feng, Z., Luo, X., Yang, T., Kita, K.: An object detection system based on yolov2 in fashion apparel. In: 2018 IEEE 4th International Conference on Computer and Communications (ICCC), pp. 1532–1536 (2018). https://doi.org/10.1109/ compcomm.2018.8780944 . IEEE

work page arXiv 2018
[25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: Keypoint triplets for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6569–6578 (2019). https://doi.org/10.1109/ ICCV.2019.00667

work page arXiv 2019
[26]

Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Ap- proaches

Kim, H.J., Lee, D.H., Niaz, A., Kim, C.Y., Memon, A.A., Choi, K.N.: Multiple- clothing detection and fashion landmark estimation using a single-stage detec- tor. IEEE Access9, 11694–11704 (2021) https://doi.org/10.1109/access.2021. 3051424

work page doi:10.1109/access.2021 2021
[27]

Applied Sciences11(9), 3782 (2021) https://doi.org/10.3390/ app11093782

Lee, C.-H., Lin, C.-W.: A two-phase fashion apparel detection method based on yolov4. Applied Sciences11(9), 3782 (2021) https://doi.org/10.3390/ app11093782

work page 2021
[28]

Efficient attention: Attention with linear complexities

Sidnev, A., Krapivin, A., Trushkov, A., Krasikova, E., Kazakov, M., Viryasov, M.: Deepmark++: Real-time clothing detection at the edge. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2980– 2988 (2021). https://doi.org/10.1109/wacv48630.2021.00302

work page doi:10.1109/wacv48630.2021.00302 2021
[29]

In: Journal of Physics: Conference Series, vol

Alamsyah, A., Saputra, M.A.A., Masrury, R.A.: Object detection using con- volutional neural network to identify popular fashion product. In: Journal of Physics: Conference Series, vol. 1192, p. 012040 (2019). https://doi.org/10.1088/ 1742-6596/1192/1/012040 . IOP Publishing

work page 2019
[30]

The Visual Computer, 1–13 (2024) https://doi.org/10.1007/s00371-024-03337-y

Li, Y., Zhang, W., Wu, M., Zhang, D., Wang, Z., You, C.: Multi-keypoints matching network for clothing detection. The Visual Computer, 1–13 (2024) https://doi.org/10.1007/s00371-024-03337-y

work page doi:10.1007/s00371-024-03337-y 2024
[31]

In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp

Hara, K., Jagadeesh, V., Piramuthu, R.: Fashion apparel detection: the role of deep convolutional neural network and pose-dependent priors. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9 (2016). https://doi.org/10.1109/WACV.2016.7477611 . IEEE

work page doi:10.1109/wacv.2016.7477611 2016
[32]

In: 2007 IEEE 11th International Conference on Computer Vision, pp

Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007). https://doi.org/10.1109/ICCV.2007.4408986 . IEEE

work page doi:10.1109/iccv.2007.4408986 2007
[34]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Zong, Z., Song, G., Liu, Y.: Detrs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6758 (2023). https://doi.org/10.1109/iccv51070.2023.00621

work page doi:10.1109/iccv51070.2023.00621 2023
[35]

2016, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788, doi: 10.1109/CVPR.2016.91

Redmon, J.: You only look once: Unified, real-time object detection. In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016). https://doi.org/10.1109/CVPR.2016.91

work page doi:10.1109/cvpr.2016.91 2016
[36]

In: International Conference on Learning Representations (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https: //openreview.net/forum?id=YicbFdNTTy

work page 2021
[37]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans- former: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021). https://doi.org/10.1109/ICCV48922.2021.00986

work page doi:10.1109/iccv48922.2021.00986 2021
[38]

In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pp

Cinbis, R.G., Sclaroff, S.: Contextual object detection using set-based classifica- tion. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pp. 43–57 (2012). https://doi.org/10.1007/978-3-642-33783-3 4 . Springer

work page doi:10.1007/978-3-642-33783-3 2012
[39]

IEEE Transactions on Cognitive and Developmental Systems14(4), 1320–1330 (2020) https://doi.org/10.1109/TCDS.2020.3008213

Alamri, F., Pugeault, N.: Improving object detection performance using scene contextual constraints. IEEE Transactions on Cognitive and Developmental Systems14(4), 1320–1330 (2020) https://doi.org/10.1109/TCDS.2020.3008213

work page doi:10.1109/tcds.2020.3008213 2020
[40]

Computer vision and image understanding114(6), 712–722 (2010) https: //doi.org/10.1016/j.cviu.2010.02.004

Galleguillos, C., Belongie, S.: Context based object categorization: A critical sur- vey. Computer vision and image understanding114(6), 712–722 (2010) https: //doi.org/10.1016/j.cviu.2010.02.004

work page doi:10.1016/j.cviu.2010.02.004 2010
[41]

2016.280

Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co- occurrence, location and appearance. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008). https://doi.org/10.1109/CVPR. 2008.4587799 . IEEE

work page doi:10.1109/cvpr 2008
[42]

In: WSCG 2016 - 24th Conference on Computer Graphics, Visu- alization and Computer Vision (2016)

Zolghadr, E., Furht, B.: Scene understanding using context-based conditional random field. In: WSCG 2016 - 24th Conference on Computer Graphics, Visu- alization and Computer Vision (2016). https://doi.org/https://doi.org/wscg.zcu. cz/WSCG2016/!! CSRN-2601.pdf

work page 2016
[43]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detec- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018). https://doi.org/10.1109/cvpr.2018.00378

work page doi:10.1109/cvpr.2018.00378 2018
[44]

In: Computer Vision–ACCV 2018: 14th Asian Conference on Com- puter Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part II 14, pp

Barnea, E., Ben-Shahar, O.: Contextual object detection with a few relevant 19 neighbors. In: Computer Vision–ACCV 2018: 14th Asian Conference on Com- puter Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part II 14, pp. 480–495 (2019). https://doi.org/10.1007/978-3-030-20890-5 31 . Springer

work page doi:10.1007/978-3-030-20890-5 2018
[45]

In: 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp

Alamri, F., Pugeault, N.: Contextual relabelling of detected objects. In: 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 313–319 (2019). https://doi.org/10.1109/devlrn. 2019.8850686 . IEEE

work page doi:10.1109/devlrn 2019
[46]

In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp

Pato, L.V., Negrinho, R., Aguiar, P.M.: Seeing without looking: Contextual rescoring of object detections for ap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14610–14618 (2020). https://doi.org/10.1109/cvpr42600.2020.01462

work page doi:10.1109/cvpr42600.2020.01462 2020
[47]

In: Proceedings of the Thirty- Second International Joint Conference on Artificial Intelligence, pp

Hao, X., Huang, D., Lin, J., Lin, C.-Y.: Relation-enhanced detr for component detection in graphic design reverse engineering. In: Proceedings of the Thirty- Second International Joint Conference on Artificial Intelligence, pp. 4785–4793 (2023). https://doi.org/10.24963/ijcai.2023/532

work page doi:10.24963/ijcai.2023/532 2023
[48]

Advances in Neural Information Pro- cessing Systems35, 38571–38584 (2022) https://doi.org/10.1109/iccc56324.2022

Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Pro- cessing Systems35, 38571–38584 (2022) https://doi.org/10.1109/iccc56324.2022. 10065997

work page doi:10.1109/iccc56324.2022 2022
[49]

Scale-Free Networks: Complex Webs in Nature and Technology

Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics16(1), 22–29 (1990) https://doi.org/10. 1093/oso/9780199292332.003.0019

work page arXiv 1990
[50]

https://github.com/facebookresearch/detr

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: DETR (DEtection TRansformer). https://github.com/facebookresearch/detr

work page
[51]

https://github.com/ Sense-X/Co-DETR

Sense-X: Co-DETR: Cooperative Detection Transformer. https://github.com/ Sense-X/Co-DETR

work page
[52]

Deep residual learning for image recognition,

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90 20

work page doi:10.1109/cvpr.2016.90 2016

[1] [1]

Expert Systems with Applications 116, 328 – 339 (2019)

Yian, S., Kyungshik, S.: Hierarchical convolutional neural networks for fashion image classification. Expert systems with applications116, 328–339 (2019) https: //doi.org/10.1016/j.eswa.2018.09.022

work page doi:10.1016/j.eswa.2018.09.022 2019

[2] [2]

Multimedia Tools and Applications82(5), 7383–7400 (2023) https://doi.org/10.1007/s11042-022-13424-8

Tian, Q., Chanda, S., Gray, D.: Improving apparel detection with category grouping and multi-grained branches. Multimedia Tools and Applications82(5), 7383–7400 (2023) https://doi.org/10.1007/s11042-022-13424-8

work page doi:10.1007/s11042-022-13424-8 2023

[3] [3]

Preprint at https://arxiv.org/abs/2111.00905 15 (2021)

Mohammadi, S.O., Kalhor, A.: Smart fashion: a review of AI applications in the Fashion & Apparel Industry. Preprint at https://arxiv.org/abs/2111.00905 15 (2021)

work page arXiv 2021

[4] [4]

Sensors23(13), 6083 (2023) https://doi.org/10.3390/s23136083

Ma, B., Xu, W.: Efficient fine tuning for fashion object detection. Sensors23(13), 6083 (2023) https://doi.org/10.3390/s23136083

work page doi:10.3390/s23136083 2023

[5] [5]

In: The World Wide Web Conference, pp

Cui, Z., Li, Z., Wu, S., Zhang, X.-Y., Wang, L.: Dressing as a whole: Outfit com- patibility learning based on node-wise graph neural networks. In: The World Wide Web Conference, pp. 307–317 (2019). https://doi.org/10.1145/3308558.3313444

work page doi:10.1145/3308558.3313444 2019

[6] [6]

Schmon, and Chris G

Sarkar, R., Bodla, N., Vasileva, M., Lin, Y.-L., Beniwal, A., Lu, A., Medioni, G.: Outfittransformer: Outfit representations for fashion recommendation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2263–2267 (2022). https://doi.org/10.1109/cvprw56347.2022. 00249

work page doi:10.1109/cvprw56347.2022 2022

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp

Lin, Y.-L., Tran, S., Davis, L.S.: Fashion outfit complementary item retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pp. 3311–3319 (2020). https://doi.org/10.1109/cvpr42600.2020. 00337

work page doi:10.1109/cvpr42600.2020 2020

[8] [8]

MoCoGAN: Decomposing motion and content for video generation

Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: An image-based virtual try- on network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7543–7552 (2018). https://doi.org/10.1109/cvpr.2018. 00787

work page doi:10.1109/cvpr.2018 2018

[9] [9]

Wheat, M

Islam, T., Miron, A., Liu, X., Li, Y.: Deep learning in virtual try-on: A comprehensive survey. IEEE Access (2024) https://doi.org/10.1109/access.2024. 3368612

work page doi:10.1109/access.2024 2024

[10] [10]

IEEE (pp

Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014). https://doi.org/10.1109/CVPR.2014.81

work page doi:10.1109/cvpr.2014.81 2014

[11] [11]

In: 2015 IEEE International Conference on Computer Vision (ICCV), pp

Girshick, R.: Fast r-cnn. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015). https://doi.org/10.1109/ICCV.2015.169

work page doi:10.1109/iccv.2015.169 2015

[12] [12]

IEEE transactions on pattern analy- sis and machine intelligence39(6), 1137–1149 (2016) https://doi.org/10.1109/ TPAMI.2016.2577031

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analy- sis and machine intelligence39(6), 1137–1149 (2016) https://doi.org/10.1109/ TPAMI.2016.2577031

work page arXiv 2016

[13] [13]

In: Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceed- ings, Part I 14, pp

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceed- ings, Part I 14, pp. 21–37 (2016). https://doi.org/10.1007/978-3-319-46448-0 2 . Springer 16

work page doi:10.1007/978-3-319-46448-0 2016

[14] [14]

Communications of the ACM 65(1), 99–106 (2021) https://doi.org/ 10.1007/978-3-030-58452-8 24

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Com- puter Vision, pp. 213–229 (2020). https://doi.org/10.1007/978-3-030-58452-8 13 . Springer

work page doi:10.1007/978-3-030-58452-8 2020

[15] [15]

In: International Con- ference on Learning Representations (2021)

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. In: International Con- ference on Learning Representations (2021). https://openreview.net/forum?id= gZ9hCDWe6ke

work page 2021

[16] [16]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional detr for fast training convergence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3651–3660 (2021). https://doi. org/10.1109/ICCV48922.2021.00363

work page doi:10.1109/iccv48922.2021.00363 2021

[17] [17]

Advances in Neural Information Pro- cessing Systems35, 34899–34911 (2022) https://doi.org/10.48550/arXiv.2205

Tian, Z., Chu, X., Wang, X., Wei, X., Shen, C.: Fully convolutional one-stage 3d object detection on lidar range images. Advances in Neural Information Pro- cessing Systems35, 34899–34911 (2022) https://doi.org/10.48550/arXiv.2205. 13764

work page doi:10.48550/arxiv.2205 2022

[18] [18]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp

Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 13619–13627 (2022). https://doi.org/10.1109/tpami.2023.3335410

work page doi:10.1109/tpami.2023.3335410 2022

[19] [19]

Dickerson

Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: DAB- DETR: Dynamic anchor boxes are better queries for DETR. In: International Conference on Learning Representations (2022). https://doi.org/10.48550/arXiv. 2201.12329 . https://openreview.net/forum?id=oMI9PjOb9Jl

work page internal anchor Pith review doi:10.48550/arxiv 2022

[20] [20]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.-Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In: The Eleventh International Conference on Learning Representations (2022). https://doi.org/10.48550/arXiv.2203.03605

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.03605 2022

[21] [21]

Preprint at https://arxiv.org/abs/2306.04670 (2023)

Shehzadi, T., Hashmi, K.A., Stricker, D., Afzal, M.Z.: 2d object detection with transformers: a review. Preprint at https://arxiv.org/abs/2306.04670 (2023). https://doi.org/10.48550/arXiv.2306.04670

work page doi:10.48550/arxiv.2306.04670 2023

[22] [22]

In: European Conference on Computer Vision, pp

Hou, X., Liu, M., Zhang, S., Wei, P., Chen, B., Lan, X.: Relation detr: Exploring explicit position relation prior for object detection. In: European Conference on Computer Vision, pp. 89–105 (2025). https://doi.org/10.1007/ 978-3-031-72973-7 6 . Springer

work page 2025

[23] [23]

CCCV 2015 Comput

Lao, B., Jagadeesh, K.: Convolutional neural networks for fashion classification and object detection. CCCV 2015 Comput. Vis546, 120–129 (2015) 17

work page 2015

[24] [24]

In: 2018 IEEE 4th International Conference on Computer and Communications (ICCC), pp

Feng, Z., Luo, X., Yang, T., Kita, K.: An object detection system based on yolov2 in fashion apparel. In: 2018 IEEE 4th International Conference on Computer and Communications (ICCC), pp. 1532–1536 (2018). https://doi.org/10.1109/ compcomm.2018.8780944 . IEEE

work page arXiv 2018

[25] [25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: Keypoint triplets for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6569–6578 (2019). https://doi.org/10.1109/ ICCV.2019.00667

work page arXiv 2019

[26] [26]

Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Ap- proaches

Kim, H.J., Lee, D.H., Niaz, A., Kim, C.Y., Memon, A.A., Choi, K.N.: Multiple- clothing detection and fashion landmark estimation using a single-stage detec- tor. IEEE Access9, 11694–11704 (2021) https://doi.org/10.1109/access.2021. 3051424

work page doi:10.1109/access.2021 2021

[27] [27]

Applied Sciences11(9), 3782 (2021) https://doi.org/10.3390/ app11093782

Lee, C.-H., Lin, C.-W.: A two-phase fashion apparel detection method based on yolov4. Applied Sciences11(9), 3782 (2021) https://doi.org/10.3390/ app11093782

work page 2021

[28] [28]

Efficient attention: Attention with linear complexities

Sidnev, A., Krapivin, A., Trushkov, A., Krasikova, E., Kazakov, M., Viryasov, M.: Deepmark++: Real-time clothing detection at the edge. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2980– 2988 (2021). https://doi.org/10.1109/wacv48630.2021.00302

work page doi:10.1109/wacv48630.2021.00302 2021

[29] [29]

In: Journal of Physics: Conference Series, vol

Alamsyah, A., Saputra, M.A.A., Masrury, R.A.: Object detection using con- volutional neural network to identify popular fashion product. In: Journal of Physics: Conference Series, vol. 1192, p. 012040 (2019). https://doi.org/10.1088/ 1742-6596/1192/1/012040 . IOP Publishing

work page 2019

[30] [30]

The Visual Computer, 1–13 (2024) https://doi.org/10.1007/s00371-024-03337-y

Li, Y., Zhang, W., Wu, M., Zhang, D., Wang, Z., You, C.: Multi-keypoints matching network for clothing detection. The Visual Computer, 1–13 (2024) https://doi.org/10.1007/s00371-024-03337-y

work page doi:10.1007/s00371-024-03337-y 2024

[31] [31]

In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp

Hara, K., Jagadeesh, V., Piramuthu, R.: Fashion apparel detection: the role of deep convolutional neural network and pose-dependent priors. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9 (2016). https://doi.org/10.1109/WACV.2016.7477611 . IEEE

work page doi:10.1109/wacv.2016.7477611 2016

[32] [32]

In: 2007 IEEE 11th International Conference on Computer Vision, pp

Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007). https://doi.org/10.1109/ICCV.2007.4408986 . IEEE

work page doi:10.1109/iccv.2007.4408986 2007

[33] [34]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Zong, Z., Song, G., Liu, Y.: Detrs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6758 (2023). https://doi.org/10.1109/iccv51070.2023.00621

work page doi:10.1109/iccv51070.2023.00621 2023

[34] [35]

2016, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788, doi: 10.1109/CVPR.2016.91

Redmon, J.: You only look once: Unified, real-time object detection. In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016). https://doi.org/10.1109/CVPR.2016.91

work page doi:10.1109/cvpr.2016.91 2016

[35] [36]

In: International Conference on Learning Representations (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https: //openreview.net/forum?id=YicbFdNTTy

work page 2021

[36] [37]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans- former: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021). https://doi.org/10.1109/ICCV48922.2021.00986

work page doi:10.1109/iccv48922.2021.00986 2021

[37] [38]

In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pp

Cinbis, R.G., Sclaroff, S.: Contextual object detection using set-based classifica- tion. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pp. 43–57 (2012). https://doi.org/10.1007/978-3-642-33783-3 4 . Springer

work page doi:10.1007/978-3-642-33783-3 2012

[38] [39]

IEEE Transactions on Cognitive and Developmental Systems14(4), 1320–1330 (2020) https://doi.org/10.1109/TCDS.2020.3008213

Alamri, F., Pugeault, N.: Improving object detection performance using scene contextual constraints. IEEE Transactions on Cognitive and Developmental Systems14(4), 1320–1330 (2020) https://doi.org/10.1109/TCDS.2020.3008213

work page doi:10.1109/tcds.2020.3008213 2020

[39] [40]

Computer vision and image understanding114(6), 712–722 (2010) https: //doi.org/10.1016/j.cviu.2010.02.004

Galleguillos, C., Belongie, S.: Context based object categorization: A critical sur- vey. Computer vision and image understanding114(6), 712–722 (2010) https: //doi.org/10.1016/j.cviu.2010.02.004

work page doi:10.1016/j.cviu.2010.02.004 2010

[40] [41]

2016.280

Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co- occurrence, location and appearance. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008). https://doi.org/10.1109/CVPR. 2008.4587799 . IEEE

work page doi:10.1109/cvpr 2008

[41] [42]

In: WSCG 2016 - 24th Conference on Computer Graphics, Visu- alization and Computer Vision (2016)

Zolghadr, E., Furht, B.: Scene understanding using context-based conditional random field. In: WSCG 2016 - 24th Conference on Computer Graphics, Visu- alization and Computer Vision (2016). https://doi.org/https://doi.org/wscg.zcu. cz/WSCG2016/!! CSRN-2601.pdf

work page 2016

[42] [43]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detec- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018). https://doi.org/10.1109/cvpr.2018.00378

work page doi:10.1109/cvpr.2018.00378 2018

[43] [44]

In: Computer Vision–ACCV 2018: 14th Asian Conference on Com- puter Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part II 14, pp

Barnea, E., Ben-Shahar, O.: Contextual object detection with a few relevant 19 neighbors. In: Computer Vision–ACCV 2018: 14th Asian Conference on Com- puter Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part II 14, pp. 480–495 (2019). https://doi.org/10.1007/978-3-030-20890-5 31 . Springer

work page doi:10.1007/978-3-030-20890-5 2018

[44] [45]

In: 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp

Alamri, F., Pugeault, N.: Contextual relabelling of detected objects. In: 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 313–319 (2019). https://doi.org/10.1109/devlrn. 2019.8850686 . IEEE

work page doi:10.1109/devlrn 2019

[45] [46]

In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp

Pato, L.V., Negrinho, R., Aguiar, P.M.: Seeing without looking: Contextual rescoring of object detections for ap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14610–14618 (2020). https://doi.org/10.1109/cvpr42600.2020.01462

work page doi:10.1109/cvpr42600.2020.01462 2020

[46] [47]

In: Proceedings of the Thirty- Second International Joint Conference on Artificial Intelligence, pp

Hao, X., Huang, D., Lin, J., Lin, C.-Y.: Relation-enhanced detr for component detection in graphic design reverse engineering. In: Proceedings of the Thirty- Second International Joint Conference on Artificial Intelligence, pp. 4785–4793 (2023). https://doi.org/10.24963/ijcai.2023/532

work page doi:10.24963/ijcai.2023/532 2023

[47] [48]

Advances in Neural Information Pro- cessing Systems35, 38571–38584 (2022) https://doi.org/10.1109/iccc56324.2022

Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Pro- cessing Systems35, 38571–38584 (2022) https://doi.org/10.1109/iccc56324.2022. 10065997

work page doi:10.1109/iccc56324.2022 2022

[48] [49]

Scale-Free Networks: Complex Webs in Nature and Technology

Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics16(1), 22–29 (1990) https://doi.org/10. 1093/oso/9780199292332.003.0019

work page arXiv 1990

[49] [50]

https://github.com/facebookresearch/detr

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: DETR (DEtection TRansformer). https://github.com/facebookresearch/detr

work page

[50] [51]

https://github.com/ Sense-X/Co-DETR

Sense-X: Co-DETR: Cooperative Detection Transformer. https://github.com/ Sense-X/Co-DETR

work page

[51] [52]

Deep residual learning for image recognition,

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90 20

work page doi:10.1109/cvpr.2016.90 2016