A document is worth a structured record: Principled inductive bias design for document recognition

Ahmed Abdulkadir; Benjamin F. Grewe; Benjamin Meyer; Daniel Schmid; Erdal Ayfer; Lukas Tuggener; Sascha H\"anzi; Thilo Stadelmann

arxiv: 2507.08458 · v2 · submitted 2025-07-11 · 💻 cs.CV · cs.AI

A document is worth a structured record: Principled inductive bias design for document recognition

Benjamin Meyer , Lukas Tuggener , Sascha H\"anzi , Daniel Schmid , Erdal Ayfer , Benjamin F. Grewe , Ahmed Abdulkadir , Thilo Stadelmann This is my paper

Pith reviewed 2026-05-19 04:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords document recognitioninductive biastransformer architectureengineering drawingsstructured recordsend-to-end modelsrelational biasestranscription task

0 comments

The pith

Treating document recognition as transcription to structured records allows design of relational inductive biases in transformers that enable end-to-end models for complex document types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard computer vision approaches to document recognition ignore the convention-driven structures that encode precise information in many document types, forcing reliance on heuristic post-processing. Framing the task instead as transcription from document to record naturally groups documents by shared structural properties in their output, so related types can be handled and learned together. The authors propose a method to encode these structures as relational inductive biases inside a base transformer architecture and adapt the same architecture across different record structures. Experiments on monophonic sheet music, shape drawings, and simplified engineering drawings demonstrate the approach, with the key result being the first successful end-to-end transcription of mechanical engineering drawings to inherently interlinked information. A sympathetic reader would care because the method offers a systematic way to handle less frequent or more intricate document types without custom post-processing pipelines.

Core claim

The authors establish that integrating an inductive bias for unrestricted graph structures into a base transformer architecture produces the first successful end-to-end model for transcribing mechanical engineering drawings to their inherently interlinked information. This follows from designing structure-specific relational inductive biases that capture the intrinsic, convention-driven properties of each document type, allowing the same architecture to be adapted across record structures while eliminating dependence on heuristic post-processing.

What carries the argument

A base transformer architecture adapted with structure-specific relational inductive biases that encode the convention-driven structures of document types for direct transcription to structured records.

If this is right

Documents sharing similar transcription structures can be grouped and learned together within the same adapted architecture.
End-to-end recognition becomes feasible for document types whose output records contain complex interlinked information.
The same base architecture can be reused across document types by swapping only the relational inductive bias.
The design principle offers a template for building future document foundation models without type-specific post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding structural knowledge directly via biases could lower the amount of labeled data needed when adapting to new document types.
The same bias-design approach might extend to non-document structured transcription tasks such as scientific diagrams or circuit schematics.
Applying the method to noisy real-world scans rather than simplified drawings would test whether the biases remain effective under realistic conditions.

Load-bearing premise

The intrinsic, convention-driven structures of document types can be effectively captured as relational inductive biases inside a transformer architecture.

What would settle it

An experiment in which the adapted transformer for mechanical engineering drawings fails to output accurate interlinked records end-to-end and still requires heuristic post-processing to reach usable accuracy.

read the original abstract

Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, many state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific relational inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe mechanical engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes document recognition as image-to-structured-record transcription and builds type-specific relational biases into a transformer, with the graph bias for engineering drawings as the standout piece, though the abstract leaves the end-to-end claim and numbers unverified.

read the letter

The main thing to know is that this work treats document recognition as producing a full structured record rather than loose text or symbols, then designs matching relational inductive biases for the transformer. They start with a base architecture and adapt it across record structures, showing results on music, shapes, and simplified drawings before claiming the first end-to-end success on mechanical engineering drawings via an unrestricted graph bias.

Referee Report

2 major / 2 minor

Summary. The paper frames document recognition as transcription from document to structured record, proposing a method to design structure-specific relational inductive biases for transformer models. It introduces a base architecture adaptable to different structures and demonstrates effectiveness through progressive experiments on monophonic sheet music, shape drawings, and simplified engineering drawings. The central claim is that an inductive bias for unrestricted graph structures enables the first successful end-to-end model for transcribing mechanical engineering drawings to inherently interlinked information without heuristic post-processing.

Significance. If the experimental claims are substantiated with metrics and details, this could offer a principled methodology for embedding document-type conventions as inductive biases in end-to-end systems, potentially extending modern recognition techniques to complex or infrequent document types like engineering drawings and informing the design of unified document foundation models.

major comments (2)

Abstract: The assertion of the 'first-ever successful end-to-end model' for mechanical engineering drawings to interlinked information is presented without any quantitative metrics, baselines, error analysis, dataset descriptions, or comparisons, rendering the central effectiveness claim unverifiable from the provided text.
Abstract and method description: The mechanism by which the transformer with an inductive bias for unrestricted graph structures directly outputs variable-sized arbitrary graphs (nodes and edges) without any post-processing step such as thresholding or rule-based assembly is not specified; standard transformer outputs are sequential or fixed, so the end-to-end claim requires an explicit output representation that is not detailed.

minor comments (2)

Abstract: The qualifier 'simplified engineering drawings' is introduced without defining the nature or extent of the simplifications, which is necessary to evaluate whether the unrestricted graph bias generalizes to real-world, unrestricted connectivity.
Abstract: The progressive experiments are described at a high level ('extensive experiments with progressively complex record structures') but lack any reference to specific tables, figures, or quantitative results that would allow assessment of the inductive bias contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to improve clarity and substantiation of the central claims.

read point-by-point responses

Referee: Abstract: The assertion of the 'first-ever successful end-to-end model' for mechanical engineering drawings to interlinked information is presented without any quantitative metrics, baselines, error analysis, dataset descriptions, or comparisons, rendering the central effectiveness claim unverifiable from the provided text.

Authors: We agree that the abstract, as a high-level summary, does not include these details. The full manuscript reports quantitative metrics, baselines, error analysis, and dataset descriptions for the engineering drawings experiments in the Experiments section. To make the claim more verifiable at a glance, we have added a concise reference to the achieved performance improvements in the revised abstract. revision: yes
Referee: Abstract and method description: The mechanism by which the transformer with an inductive bias for unrestricted graph structures directly outputs variable-sized arbitrary graphs (nodes and edges) without any post-processing step such as thresholding or rule-based assembly is not specified; standard transformer outputs are sequential or fixed, so the end-to-end claim requires an explicit output representation that is not detailed.

Authors: The architecture uses an autoregressive transformer decoder that generates a linearized sequence of node and edge tokens according to a fixed schema derived from the graph inductive bias; this sequence is directly parsed into the variable-sized graph without thresholding or rule-based assembly. We have expanded the method section in the revision to explicitly describe this output representation and how the unrestricted graph bias enables it. revision: yes

Circularity Check

0 steps flagged

No circularity: inductive bias design and end-to-end transcription claims rest on independent methodological proposal plus experiments

full rationale

The paper frames document recognition as transcription to structured records, proposes a base transformer with structure-specific relational inductive biases, and reports experimental results on monophonic music, shape drawings, and simplified engineering drawings. The central claim of the first successful end-to-end model for unrestricted graph outputs is presented as the outcome of integrating the proposed bias, not as a quantity derived by construction from fitted parameters or prior self-referential definitions. No equations are shown that reduce a prediction to an input fit; no uniqueness theorem or ansatz is imported via self-citation in a load-bearing way; the derivation chain from document structure to bias design to model architecture remains self-contained and externally validated by the progressive experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that documents possess intrinsic structures suitable for relational modeling; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information.
Invoked in the opening of the abstract as the foundation for reframing recognition as transcription.

pith-pipeline@v0.9.0 · 5809 in / 1231 out tokens · 75070 ms · 2026-05-19T04:54:09.556264+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR
cs.SD 2026-04 unverdicted novelty 5.0

A two-stage OMR pipeline decodes symbol candidates into polyphonic score structures via topology recognition with probability-guided search.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

In: Proc

Stadelmann, T., Amirian, M., Arabaci, I., Arnold, M., Duivesteijn, G.F.,et al.: Deep learning in the wild. In: Proc. of the Arti- ficial Neural Networks in Pattern Recogni- tion 8th IAPR TC3 Workshop, pp. 17–38. Springer, Siena, Italy (2018). DOI: 10.1007/ 978-3-319-99978-4_2

work page 2018
[2]

Masset, R

Chai, J., Zeng, H., Li, A., Ngai, E.W.: Deep learning in computer vision: A critical review of emerging techniques and application sce- narios. Machine Learning with Applications 6, 100134–100147 (2021) DOI: 10.1016/j. mlwa.2021.100134

work page doi:10.1016/j 2021
[3]

arXiv preprint (2020) DOI: 10

Subramani, N., Matton, A., Greaves, M., Lam, A.: A survey of deep learning approaches for OCR and document under- standing. arXiv preprint (2020) DOI: 10. 48550/arXiv.2011.13534 16

work page arXiv 2020
[4]

In: Proc

Ríos-Vila, A., Calvo-Zaragoza, J., Paquet, T.: Sheet Music Transformer: End-to-end optical music recognition beyond mono- phonic transcription. In: Proc. of the 18th Int. Conf. Doc. Anal. Recognit. (ICDAR), pp. 20–37. Springer, Athens, Greece (2024). DOI: 10.1007/978-3-031-70552-6_2

work page doi:10.1007/978-3-031-70552-6_2 2024
[5]

In: Proc

Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolu- tional neural networks for newspaper article segmentation. In: Proc. of the 14th Int. Conf. Doc. Anal. Recognit. (ICDAR), vol. 1, pp. 414–419 (2017). DOI: 10.1109/ICDAR.2017. 75

work page doi:10.1109/icdar.2017 2017
[6]

In: Proc

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y.,et al.: TrOCR: Transformer-based optical char- acter recognition with pre-trained models. In: Proc. of the 37th AAAI Conf. Artif. Intell., Washington, DC, USA, pp. 13094–13102 (2023). DOI: 10.1609/aaai.v37i11.26538

work page doi:10.1609/aaai.v37i11.26538 2023
[7]

IEEE Access12, 76963–76974 (2024) DOI: 10.1109/ACCESS.2024.3404834

Schmitt-Koopmann, F.M., Huang, E.M., Hutter, H.-P., Stadelmann, T., Darvishy, A.: MathNet: A data-centric approach for printed mathematical expression recognition. IEEE Access12, 76963–76974 (2024) DOI: 10.1109/ACCESS.2024.3404834

work page doi:10.1109/access.2024.3404834 2024
[8]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Wei, H., Liu, C., Chen, J., Wang, J., Kong, L.,et al.: General OCR theory: Towards OCR-2.0 via a unified end-to-end model. arXiv preprint (2024) DOI: 10.48550/arXiv. 2409.01704

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[9]

arXiv preprint (2022) DOI: 10.48550/arXiv.2204.13277

Sarkar, S., Pandey, P., Kar, S.: Automatic detection and classification of symbols in engineering drawings. arXiv preprint (2022) DOI: 10.48550/arXiv.2204.13277

work page doi:10.48550/arxiv.2204.13277 2022
[10]

Score-cam: Score-weighted visual explanations for convolutional neural net- works

Rezvanifar, A., Cote, M., Albu, A.B.: Sym- bol spotting on digital architectural floor plans using a deep learning-based framework. In: Proc. of the 2020 Conf. Comput. Vis. Pattern Recognit. Workshops, Seattle, WA, USA, pp. 568–569 (2020). DOI: 10.1109/ CVPRW50498.2020.00292

work page arXiv 2020
[11]

Research Square preprint (2023) DOI: 10

Uzair, W., Chai, D., Rassau, A.: ElectroNet: An enhanced model for small-scale object detection in electrical schematic diagram. Research Square preprint (2023) DOI: 10. 21203/rs.3.rs-3137489/v1

work page 2023
[12]

In: Proc

Mardiana, B.D., Hadiningrum, T.R., Sia- haan, D.: Comparative analysis of deep learn- ing models for validating use case diagrams. In: Proc. of the 16th Int. Conf. Inf. Tech- nol. Electr. Eng. (ICITEE), pp. 141–146. IEEE, Bali, Indonesia (2024). DOI: 10.1109/ ICITEE62483.2024.10808842

work page arXiv 2024
[13]

In: Proc

Gada, M.: Object detection for P&ID images using various deep learning techniques. In: Proc. of the 2021 Int. Conf. Comput. Commun. Inform. ICCCI, pp. 1–5. IEEE, Coimbatore, India (2021). DOI: 10.1109/ ICCCI50826.2021.9402386

work page arXiv 2021
[14]

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV)115, 211–252 (2015) DOI: 10. 1007/s11263-015-0816-y

work page 2015
[15]

Kristianto, G

Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N.,et al.: Image segmen- tation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 44, 3523–3542 (2021) DOI: 10.1109/TPAMI. 2021.3059968

work page doi:10.1109/tpami 2021
[16]

Zou, Z., Chen, K., Shi, Z., Guo, Y., Ye, J.: Object detection in 20 years: A survey. Proc. of the IEEE111(3), 257–276 (2023) DOI: 10. 1109/JPROC.2023.3238524

work page arXiv 2023
[17]

Deep Watershed Detector for Music Object Recognition

Tuggener, L., Elezi, I., Schmidhuber, J., Stadelmann, T.: Deep watershed detector for music object recognition. arXiv preprint (2018) DOI: 10.48550/arXiv.1805.10548

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.10548 2018
[18]

In: Proc

Yamasaki, T., Zhang, J., Takada, Y.: Apart- ment structure estimation using fully convo- lutional networks and graph model. In: Proc. of the 2018 ACM Workshop on Multime- dia for Real Estate Tech. RETech’18, pp. 1–6. Association for Computing Machinery, New York, NY, USA (2018). DOI: 10.1145/ 3210499.32105

work page arXiv 2018
[19]

Sensors20(23), 6896–6910 (2020) DOI: 10.3390/s20236896

Buzzy, M., Thesma, V., Davoodi, M., Mohammadpour Velni, J.: Real-time plant 17 leaf counting using deep object detection networks. Sensors20(23), 6896–6910 (2020) DOI: 10.3390/s20236896

work page doi:10.3390/s20236896 2020
[20]

Applied Intel- ligence51, 6400–6429 (2021) DOI: 10.1007/ s10489-021-02293-7

Pal, S.K., Pramanik, A., Maiti, J., Mitra, P.: Deep learning in multi-object detection and tracking: state of the art. Applied Intel- ligence51, 6400–6429 (2021) DOI: 10.1007/ s10489-021-02293-7

work page 2021
[21]

Sen- sors24(15), 4861–4882 (2024) DOI: 10.3390/ s24154861

Khor, K.S., Liu, C., Cheah, C.C.: Robotic grasping of unknown objects based on deep learning-based feature detection. Sen- sors24(15), 4861–4882 (2024) DOI: 10.3390/ s24154861

work page 2024
[22]

In: Proc

Hays, J., Efros, A.A.: IM2GPS: Estimating geographic information from a single image. In: Proc. of the 2008 Conf. Comput. Vis. Pat- tern Recognit., pp. 1–8. IEEE, Anchorage, AK, USA (2008). DOI: 10.1109/CVPR.2008. 4587784

work page doi:10.1109/cvpr.2008 2008
[23]

Wilson, R.J.: Introduction to Graph Theory, 4thedn.AddisonWesley,Harlow,UK(1986)

work page 1986
[24]

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Adv. Neural Inf. Process. Syst., vol. 27. Montréal, Kanada (2014). DOI: 10.48550/ arXiv.1409.3215

work page internal anchor Pith review Pith/arXiv arXiv 2014
[25]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Poznanski, J., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D.,et al.: olmOCR: Unlock- ing trillions of tokens in PDFs with vision language models. arXiv preprint (2025) DOI: 10.48550/arXiv.2502.18443

work page doi:10.48550/arxiv.2502.18443 2025
[26]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., et al.: Qwen2-VL: Enhancing vision-language model’s perception of the world at any reso- lution. arXiv preprint (2024) DOI: 10.48550/ arXiv.2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Asurveyonhypothesisgenerationforsci- entific discovery in the era of large language models

Hamdi, L., Tamasna, A., Boisson, P., Paquet, T.: VISTA-OCR: Towards generative and interactive end to end OCR models. arXiv preprint (2025) DOI: 10.48550/arXiv.2504. 03621

work page doi:10.48550/arxiv.2504 2025
[28]

arXiv preprint (2024) DOI: 10.48550/arXiv.2410.19494

Xypolopoulos, C., Shang, G., Fei, X., Niko- lentzos, G., Abdine, H.,et al.: Graph lin- earization methods for reasoning on graphs with large language models. arXiv preprint (2024) DOI: 10.48550/arXiv.2410.19494

work page doi:10.48550/arxiv.2410.19494 2024
[29]

In: Proc

Hajic, J., Dorfer, M., Widmer, G., Pecina, P.: Towards full-pipeline handwritten OMR with musical symbol detection by U-nets. In: Proc. of the 19th Trans. Int. Soc. Music Inf. Retr. (ISMIR), Paris, France, pp. 225–232 (2018). DOI: 10.5281/zenodo.1492388

work page doi:10.5281/zenodo.1492388 2018
[30]

, author Weiss, Y

Tuggener, L., Satyawan, Y.P., Pacha, A., Schmidhuber, J., Stadelmann, T.: The Deep- ScoresV2 dataset and benchmark for music object detection. In: 2020 25th Int. Conf. on Pat. Recog. (ICPR), pp. 9188–9195. IEEE, Milan, Italy (2021). DOI: 10.1109/ ICPR48806.2021.9412290

work page arXiv 2020
[31]

IEEE Access8, 199523–199538 (2020) https://doi.org/10.1109/ACCESS

Schmitt-Koopmann, F.M., Huang, E.M., Hutter, H.-P., Stadelmann, T., Darvishy, A.: FormulaNet: A benchmark dataset for math- ematical formula detection. IEEE Access10, 91588–91596 (2022) DOI: 10.1109/ACCESS. 2022.3202639

work page doi:10.1109/access 2022
[32]

ISPRS Int

Kim, H., Kim, S., Yu, K.: Automatic extrac- tion of indoor spatial information from floor plan image: A patch-based deep learning methodology application on large-scale com- plex buildings. ISPRS Int. J. Geo-Inf.10, 828–843 (2021) DOI: 10.3390/ijgi10120828

work page doi:10.3390/ijgi10120828 2021
[33]

Applied Sciences10, 7347–7362 (2020) DOI: 10.3390/app10207347

Seo, J., Park, H., Choo, S.: Inference of draw- ing elements and space usage on architec- tural drawings using semantic segmentation. Applied Sciences10, 7347–7362 (2020) DOI: 10.3390/app10207347

work page doi:10.3390/app10207347 2020
[34]

In: Proc

Huber, F., Hagel, G.: Towards detection and syntactical analysis in UML class diagrams for software engineering education. In: Proc. of the 2020 IEEE Glob. Eng. Educ. Conf. (EDUCON), pp. 3–7. IEEE, Porto, Portugal (2020). DOI: 10.1109/EDUCON45650.2020. 9125244

work page doi:10.1109/educon45650.2020 2020
[35]

McGraw-Hill, New York, NY (2013) 18

Rosen, K.H., Krithivasan, K.: Discrete Math- ematics and Its Applications, 7th edn. McGraw-Hill, New York, NY (2013) 18

work page 2013
[36]

W., Holmes, C

Lee, K., Joshi, M., Turc, I.R., Hu, H., Liu, F., et al.: Pix2Struct: Screenshot parsing as pre- training for visual language understanding. In: Proc. of the 40th Int. Conf. Mach. Learn. (ICML), pp. 18893–18912. PMLR, Honolulu, HI, USA (2023). DOI: 10.48550/arXiv.2210. 03347

work page doi:10.48550/arxiv.2210 2023
[37]

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Zhang, Q., Huang, V.S.-J., Wang, B., Zhang, J., Wang, Z.,et al.: Document pars- ing unveiled: Techniques, challenges, and prospects for structured information extrac- tion. arXiv preprint (2024) DOI: 10.48550/ arXiv.2410.21169

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Nougat: Neural Optical Understanding for Academic Documents

Blecher, L., Cucurull, G., Scialom, T., Sto- jnic, R.: Nougat: Neural optical understand- ing for academic documents. arXiv preprint (2023) DOI: 10.48550/arXiv.2308.13418

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.13418 2023
[39]

arXiv preprint (2024) DOI: 10.48550/ arXiv.2403.12895

Hu, A., Xu, H., Ye, J., Yan, M., Zhang, L., et al.:mPLUG-DocOwl1.5:Unifiedstructure learning for OCR-free document understand- ing. arXiv preprint (2024) DOI: 10.48550/ arXiv.2403.12895

work page arXiv 2024
[40]

Archives of Data Sci- ence, Series A8, 1–16 (2022) DOI: 10.5445/ IR/1000143637

Stadelmann, T., Klamt, T., Merkt, P.H.: Data centrism and the core of data science as a scientific discipline. Archives of Data Sci- ence, Series A8, 1–16 (2022) DOI: 10.5445/ IR/1000143637

work page arXiv 2022
[41]

In: Proc

Luley, P.-P., Deriu, J.M., Yan, P., Schatte, G.A., Stadelmann, T.: From concept to implementation: The data-centric develop- ment process for AI in industry. In: Proc. of the 10th Swiss Conf. Data Sci. (SDS), pp. 73–

work page
[42]

DOI: 10.1109/SDS57534.2023.00017

IEEE, Zurich, Switzerland (2023). DOI: 10.1109/SDS57534.2023.00017

work page doi:10.1109/sds57534.2023.00017 2023
[43]

In: Proc

Tuggener, L., Sager, P., Taoudi- Benchekroun, Y., Grewe, B.F., Stadelmann, T.: So you want your private LLM at home? A survey and benchmark of methods for efficient GPTs. In: Proc. of the 11th Swiss Conf. Data Sci. (SDS), pp. 205–212. IEEE, Zurich, Switzerland (2024). DOI: 10.1109/SDS60720.2024.00036

work page doi:10.1109/sds60720.2024.00036 2024
[44]

Nienhuys, H.-W., Nieuwenhuizen, J.: Lily- Pond – Essay on automated music engraving. (2003)

work page 2003
[45]

In: AMW (2018)

Angles, R.: The property graph database model. In: AMW (2018)

work page 2018
[46]

(unpublished) technical report at Aston Uni- versity (1994)

Bishop, C.M.: Mixture density networks. (unpublished) technical report at Aston Uni- versity (1994)

work page 1994
[47]

Version 0.9.2, 2022-06- 27 (2022)

LeCun, Y.: A path towards autonomous machine intelligence. Version 0.9.2, 2022-06- 27 (2022)

work page 2022
[48]

Compositional semantic parsing on semi-structured tables

Cho, K., Merriënboer, B., Gulcehre, C., Bah- danau, D., Bougares, F.,et al.: Learning phrase representations using RNN encoder– decoder for statistical machine translation. In: Proc. of the 2014 Conf. Empir. Methods Nat. Lang. Process. (EMNLP), pp. 1724– 1734.AssociationforComputationalLinguis- tics, Doha, Qatar (2014). DOI: 10.3115/v1/ D14-1179

work page doi:10.3115/v1/ 2014
[49]

Vaswani, A., Shazeer, N., Parmar, N., Uszko- reit, J., Jones, L.,et al.: Attention is all you need. In: Adv. Neural Inf. Process. Syst., vol. 30. Curran Associates, Inc., Long Beach, CA, USA (2017). DOI: 10.48550/arXiv.1706. 03762

work page doi:10.48550/arxiv.1706 2017
[50]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,et al.: An Image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. of the 9th Int. Conf. Learn. Represent. (ICLR) (2021). DOI: 10.48550/arXiv.2010.11929

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2021
[51]

A ConvNet for the 2020s

He, K., Chen, X., Xie, S., Li, Y., Dollár, P.,et al.: Masked autoencoders are scalable vision learners. In: Proc. of the 2022 Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 16000– 16009. IEEE, New Orleans, LA, USA (2022). DOI: 10.1109/CVPR52688.2022.01553

work page doi:10.1109/cvpr52688.2022.01553 2022
[52]

Graph Attention Networks

Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P.,et al.: Graph atten- tion networks. arXiv preprint (2017) DOI: 10.48550/arXiv.1710.10903

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1710.10903 2017
[53]

Veličković, P.: Everything is connected: Graph neural networks. Curr. Opin. Struct. Biol. (COSB)79, 102538 (2023) DOI: 10. 1016/j.sbi.2023.102538 19

work page arXiv 2023
[54]

Neural Comput.1, 270–280 (1989)

Williams, R.J., Zipser, D.: A learning algo- rithm for continually running fully recurrent neural networks. Neural Comput.1, 270–280 (1989)

work page 1989
[55]

Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM13, 377–387 (1970)

work page 1970
[56]

In: Proc

Boudaoud, A., Mahfoud, H., Chikh, A.: Towards a complete direct mapping from relational databases to property graphs. In: Proc. of the 38th Int. Conf. Data Eng. (ICDE), pp. 222–235. Springer, Kuala Lumpur, Malaysia (2022). DOI: 10.1007/ 978-3-031-21595-7_16

work page 2022
[57]

In: Proc

Koch, S., Matveev, A., Jiang, Z., Williams, F., Artemov, A.,et al.: ABC: A big CAD model dataset for geometric deep learning. In: Proc. of the 2019 Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, pp. 9601–9611 (2019). DOI: 10.1109/ CVPR.2019.00983

work page arXiv 2019
[58]

Galimberti, R.: An algorithm for hidden line elimination. Commun. ACM12, 206–211 (1969)

work page 1969
[59]

Image Transformer

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N.,et al.: Image trans- former. In: Proc. of the 35th Int. Conf. Mach.Learn.(ICML),vol.80,pp.4055–4064. PMLR, Stockholm, Sweden (2018). DOI: 10. 48550/arXiv.1802.05751

work page internal anchor Pith review Pith/arXiv arXiv 2018
[60]

Gaussian Error Linear Units (GELUs)

Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint (2016) DOI: 10.48550/arXiv.1606.08415

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1606.08415 2016
[61]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proc. of the 8th Int. Conf. Learn. Represent. (ICLR) (2018). DOI: 10.48550/arXiv.1711.05101

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.05101 2018
[62]

Ganda, D., Buch, R.: A survey on multi label classification.RecentTrendsinProgramming Languages5, 19–23 (2018)

work page 2018
[63]

Tuggener, L., Emberger, R., Ghosh, A., Sager, P., Satyawan, Y.P.,et al.: Real world music object recognition. Trans. Int. Soc. MusicInf.Retr.7,1–14(2024)DOI:10.5334/ tismir.157 Appendix A Proof for read function space reduction Here, we provide the mathematical proof that it follows from Equation (1) that the read function space reduces to a single read f...

work page 2024

[1] [1]

In: Proc

Stadelmann, T., Amirian, M., Arabaci, I., Arnold, M., Duivesteijn, G.F.,et al.: Deep learning in the wild. In: Proc. of the Arti- ficial Neural Networks in Pattern Recogni- tion 8th IAPR TC3 Workshop, pp. 17–38. Springer, Siena, Italy (2018). DOI: 10.1007/ 978-3-319-99978-4_2

work page 2018

[2] [2]

Masset, R

Chai, J., Zeng, H., Li, A., Ngai, E.W.: Deep learning in computer vision: A critical review of emerging techniques and application sce- narios. Machine Learning with Applications 6, 100134–100147 (2021) DOI: 10.1016/j. mlwa.2021.100134

work page doi:10.1016/j 2021

[3] [3]

arXiv preprint (2020) DOI: 10

Subramani, N., Matton, A., Greaves, M., Lam, A.: A survey of deep learning approaches for OCR and document under- standing. arXiv preprint (2020) DOI: 10. 48550/arXiv.2011.13534 16

work page arXiv 2020

[4] [4]

In: Proc

Ríos-Vila, A., Calvo-Zaragoza, J., Paquet, T.: Sheet Music Transformer: End-to-end optical music recognition beyond mono- phonic transcription. In: Proc. of the 18th Int. Conf. Doc. Anal. Recognit. (ICDAR), pp. 20–37. Springer, Athens, Greece (2024). DOI: 10.1007/978-3-031-70552-6_2

work page doi:10.1007/978-3-031-70552-6_2 2024

[5] [5]

In: Proc

Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolu- tional neural networks for newspaper article segmentation. In: Proc. of the 14th Int. Conf. Doc. Anal. Recognit. (ICDAR), vol. 1, pp. 414–419 (2017). DOI: 10.1109/ICDAR.2017. 75

work page doi:10.1109/icdar.2017 2017

[6] [6]

In: Proc

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y.,et al.: TrOCR: Transformer-based optical char- acter recognition with pre-trained models. In: Proc. of the 37th AAAI Conf. Artif. Intell., Washington, DC, USA, pp. 13094–13102 (2023). DOI: 10.1609/aaai.v37i11.26538

work page doi:10.1609/aaai.v37i11.26538 2023

[7] [7]

IEEE Access12, 76963–76974 (2024) DOI: 10.1109/ACCESS.2024.3404834

Schmitt-Koopmann, F.M., Huang, E.M., Hutter, H.-P., Stadelmann, T., Darvishy, A.: MathNet: A data-centric approach for printed mathematical expression recognition. IEEE Access12, 76963–76974 (2024) DOI: 10.1109/ACCESS.2024.3404834

work page doi:10.1109/access.2024.3404834 2024

[8] [8]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Wei, H., Liu, C., Chen, J., Wang, J., Kong, L.,et al.: General OCR theory: Towards OCR-2.0 via a unified end-to-end model. arXiv preprint (2024) DOI: 10.48550/arXiv. 2409.01704

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[9] [9]

arXiv preprint (2022) DOI: 10.48550/arXiv.2204.13277

Sarkar, S., Pandey, P., Kar, S.: Automatic detection and classification of symbols in engineering drawings. arXiv preprint (2022) DOI: 10.48550/arXiv.2204.13277

work page doi:10.48550/arxiv.2204.13277 2022

[10] [10]

Score-cam: Score-weighted visual explanations for convolutional neural net- works

Rezvanifar, A., Cote, M., Albu, A.B.: Sym- bol spotting on digital architectural floor plans using a deep learning-based framework. In: Proc. of the 2020 Conf. Comput. Vis. Pattern Recognit. Workshops, Seattle, WA, USA, pp. 568–569 (2020). DOI: 10.1109/ CVPRW50498.2020.00292

work page arXiv 2020

[11] [11]

Research Square preprint (2023) DOI: 10

Uzair, W., Chai, D., Rassau, A.: ElectroNet: An enhanced model for small-scale object detection in electrical schematic diagram. Research Square preprint (2023) DOI: 10. 21203/rs.3.rs-3137489/v1

work page 2023

[12] [12]

In: Proc

Mardiana, B.D., Hadiningrum, T.R., Sia- haan, D.: Comparative analysis of deep learn- ing models for validating use case diagrams. In: Proc. of the 16th Int. Conf. Inf. Tech- nol. Electr. Eng. (ICITEE), pp. 141–146. IEEE, Bali, Indonesia (2024). DOI: 10.1109/ ICITEE62483.2024.10808842

work page arXiv 2024

[13] [13]

In: Proc

Gada, M.: Object detection for P&ID images using various deep learning techniques. In: Proc. of the 2021 Int. Conf. Comput. Commun. Inform. ICCCI, pp. 1–5. IEEE, Coimbatore, India (2021). DOI: 10.1109/ ICCCI50826.2021.9402386

work page arXiv 2021

[14] [14]

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV)115, 211–252 (2015) DOI: 10. 1007/s11263-015-0816-y

work page 2015

[15] [15]

Kristianto, G

Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N.,et al.: Image segmen- tation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 44, 3523–3542 (2021) DOI: 10.1109/TPAMI. 2021.3059968

work page doi:10.1109/tpami 2021

[16] [16]

Zou, Z., Chen, K., Shi, Z., Guo, Y., Ye, J.: Object detection in 20 years: A survey. Proc. of the IEEE111(3), 257–276 (2023) DOI: 10. 1109/JPROC.2023.3238524

work page arXiv 2023

[17] [17]

Deep Watershed Detector for Music Object Recognition

Tuggener, L., Elezi, I., Schmidhuber, J., Stadelmann, T.: Deep watershed detector for music object recognition. arXiv preprint (2018) DOI: 10.48550/arXiv.1805.10548

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.10548 2018

[18] [18]

In: Proc

Yamasaki, T., Zhang, J., Takada, Y.: Apart- ment structure estimation using fully convo- lutional networks and graph model. In: Proc. of the 2018 ACM Workshop on Multime- dia for Real Estate Tech. RETech’18, pp. 1–6. Association for Computing Machinery, New York, NY, USA (2018). DOI: 10.1145/ 3210499.32105

work page arXiv 2018

[19] [19]

Sensors20(23), 6896–6910 (2020) DOI: 10.3390/s20236896

Buzzy, M., Thesma, V., Davoodi, M., Mohammadpour Velni, J.: Real-time plant 17 leaf counting using deep object detection networks. Sensors20(23), 6896–6910 (2020) DOI: 10.3390/s20236896

work page doi:10.3390/s20236896 2020

[20] [20]

Applied Intel- ligence51, 6400–6429 (2021) DOI: 10.1007/ s10489-021-02293-7

Pal, S.K., Pramanik, A., Maiti, J., Mitra, P.: Deep learning in multi-object detection and tracking: state of the art. Applied Intel- ligence51, 6400–6429 (2021) DOI: 10.1007/ s10489-021-02293-7

work page 2021

[21] [21]

Sen- sors24(15), 4861–4882 (2024) DOI: 10.3390/ s24154861

Khor, K.S., Liu, C., Cheah, C.C.: Robotic grasping of unknown objects based on deep learning-based feature detection. Sen- sors24(15), 4861–4882 (2024) DOI: 10.3390/ s24154861

work page 2024

[22] [22]

In: Proc

Hays, J., Efros, A.A.: IM2GPS: Estimating geographic information from a single image. In: Proc. of the 2008 Conf. Comput. Vis. Pat- tern Recognit., pp. 1–8. IEEE, Anchorage, AK, USA (2008). DOI: 10.1109/CVPR.2008. 4587784

work page doi:10.1109/cvpr.2008 2008

[23] [23]

Wilson, R.J.: Introduction to Graph Theory, 4thedn.AddisonWesley,Harlow,UK(1986)

work page 1986

[24] [24]

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Adv. Neural Inf. Process. Syst., vol. 27. Montréal, Kanada (2014). DOI: 10.48550/ arXiv.1409.3215

work page internal anchor Pith review Pith/arXiv arXiv 2014

[25] [25]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Poznanski, J., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D.,et al.: olmOCR: Unlock- ing trillions of tokens in PDFs with vision language models. arXiv preprint (2025) DOI: 10.48550/arXiv.2502.18443

work page doi:10.48550/arxiv.2502.18443 2025

[26] [26]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., et al.: Qwen2-VL: Enhancing vision-language model’s perception of the world at any reso- lution. arXiv preprint (2024) DOI: 10.48550/ arXiv.2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Asurveyonhypothesisgenerationforsci- entific discovery in the era of large language models

Hamdi, L., Tamasna, A., Boisson, P., Paquet, T.: VISTA-OCR: Towards generative and interactive end to end OCR models. arXiv preprint (2025) DOI: 10.48550/arXiv.2504. 03621

work page doi:10.48550/arxiv.2504 2025

[28] [28]

arXiv preprint (2024) DOI: 10.48550/arXiv.2410.19494

Xypolopoulos, C., Shang, G., Fei, X., Niko- lentzos, G., Abdine, H.,et al.: Graph lin- earization methods for reasoning on graphs with large language models. arXiv preprint (2024) DOI: 10.48550/arXiv.2410.19494

work page doi:10.48550/arxiv.2410.19494 2024

[29] [29]

In: Proc

Hajic, J., Dorfer, M., Widmer, G., Pecina, P.: Towards full-pipeline handwritten OMR with musical symbol detection by U-nets. In: Proc. of the 19th Trans. Int. Soc. Music Inf. Retr. (ISMIR), Paris, France, pp. 225–232 (2018). DOI: 10.5281/zenodo.1492388

work page doi:10.5281/zenodo.1492388 2018

[30] [30]

, author Weiss, Y

Tuggener, L., Satyawan, Y.P., Pacha, A., Schmidhuber, J., Stadelmann, T.: The Deep- ScoresV2 dataset and benchmark for music object detection. In: 2020 25th Int. Conf. on Pat. Recog. (ICPR), pp. 9188–9195. IEEE, Milan, Italy (2021). DOI: 10.1109/ ICPR48806.2021.9412290

work page arXiv 2020

[31] [31]

IEEE Access8, 199523–199538 (2020) https://doi.org/10.1109/ACCESS

Schmitt-Koopmann, F.M., Huang, E.M., Hutter, H.-P., Stadelmann, T., Darvishy, A.: FormulaNet: A benchmark dataset for math- ematical formula detection. IEEE Access10, 91588–91596 (2022) DOI: 10.1109/ACCESS. 2022.3202639

work page doi:10.1109/access 2022

[32] [32]

ISPRS Int

Kim, H., Kim, S., Yu, K.: Automatic extrac- tion of indoor spatial information from floor plan image: A patch-based deep learning methodology application on large-scale com- plex buildings. ISPRS Int. J. Geo-Inf.10, 828–843 (2021) DOI: 10.3390/ijgi10120828

work page doi:10.3390/ijgi10120828 2021

[33] [33]

Applied Sciences10, 7347–7362 (2020) DOI: 10.3390/app10207347

Seo, J., Park, H., Choo, S.: Inference of draw- ing elements and space usage on architec- tural drawings using semantic segmentation. Applied Sciences10, 7347–7362 (2020) DOI: 10.3390/app10207347

work page doi:10.3390/app10207347 2020

[34] [34]

In: Proc

Huber, F., Hagel, G.: Towards detection and syntactical analysis in UML class diagrams for software engineering education. In: Proc. of the 2020 IEEE Glob. Eng. Educ. Conf. (EDUCON), pp. 3–7. IEEE, Porto, Portugal (2020). DOI: 10.1109/EDUCON45650.2020. 9125244

work page doi:10.1109/educon45650.2020 2020

[35] [35]

McGraw-Hill, New York, NY (2013) 18

Rosen, K.H., Krithivasan, K.: Discrete Math- ematics and Its Applications, 7th edn. McGraw-Hill, New York, NY (2013) 18

work page 2013

[36] [36]

W., Holmes, C

Lee, K., Joshi, M., Turc, I.R., Hu, H., Liu, F., et al.: Pix2Struct: Screenshot parsing as pre- training for visual language understanding. In: Proc. of the 40th Int. Conf. Mach. Learn. (ICML), pp. 18893–18912. PMLR, Honolulu, HI, USA (2023). DOI: 10.48550/arXiv.2210. 03347

work page doi:10.48550/arxiv.2210 2023

[37] [37]

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Zhang, Q., Huang, V.S.-J., Wang, B., Zhang, J., Wang, Z.,et al.: Document pars- ing unveiled: Techniques, challenges, and prospects for structured information extrac- tion. arXiv preprint (2024) DOI: 10.48550/ arXiv.2410.21169

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Nougat: Neural Optical Understanding for Academic Documents

Blecher, L., Cucurull, G., Scialom, T., Sto- jnic, R.: Nougat: Neural optical understand- ing for academic documents. arXiv preprint (2023) DOI: 10.48550/arXiv.2308.13418

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.13418 2023

[39] [39]

arXiv preprint (2024) DOI: 10.48550/ arXiv.2403.12895

Hu, A., Xu, H., Ye, J., Yan, M., Zhang, L., et al.:mPLUG-DocOwl1.5:Unifiedstructure learning for OCR-free document understand- ing. arXiv preprint (2024) DOI: 10.48550/ arXiv.2403.12895

work page arXiv 2024

[40] [40]

Archives of Data Sci- ence, Series A8, 1–16 (2022) DOI: 10.5445/ IR/1000143637

Stadelmann, T., Klamt, T., Merkt, P.H.: Data centrism and the core of data science as a scientific discipline. Archives of Data Sci- ence, Series A8, 1–16 (2022) DOI: 10.5445/ IR/1000143637

work page arXiv 2022

[41] [41]

In: Proc

Luley, P.-P., Deriu, J.M., Yan, P., Schatte, G.A., Stadelmann, T.: From concept to implementation: The data-centric develop- ment process for AI in industry. In: Proc. of the 10th Swiss Conf. Data Sci. (SDS), pp. 73–

work page

[42] [42]

DOI: 10.1109/SDS57534.2023.00017

IEEE, Zurich, Switzerland (2023). DOI: 10.1109/SDS57534.2023.00017

work page doi:10.1109/sds57534.2023.00017 2023

[43] [43]

In: Proc

Tuggener, L., Sager, P., Taoudi- Benchekroun, Y., Grewe, B.F., Stadelmann, T.: So you want your private LLM at home? A survey and benchmark of methods for efficient GPTs. In: Proc. of the 11th Swiss Conf. Data Sci. (SDS), pp. 205–212. IEEE, Zurich, Switzerland (2024). DOI: 10.1109/SDS60720.2024.00036

work page doi:10.1109/sds60720.2024.00036 2024

[44] [44]

Nienhuys, H.-W., Nieuwenhuizen, J.: Lily- Pond – Essay on automated music engraving. (2003)

work page 2003

[45] [45]

In: AMW (2018)

Angles, R.: The property graph database model. In: AMW (2018)

work page 2018

[46] [46]

(unpublished) technical report at Aston Uni- versity (1994)

Bishop, C.M.: Mixture density networks. (unpublished) technical report at Aston Uni- versity (1994)

work page 1994

[47] [47]

Version 0.9.2, 2022-06- 27 (2022)

LeCun, Y.: A path towards autonomous machine intelligence. Version 0.9.2, 2022-06- 27 (2022)

work page 2022

[48] [48]

Compositional semantic parsing on semi-structured tables

Cho, K., Merriënboer, B., Gulcehre, C., Bah- danau, D., Bougares, F.,et al.: Learning phrase representations using RNN encoder– decoder for statistical machine translation. In: Proc. of the 2014 Conf. Empir. Methods Nat. Lang. Process. (EMNLP), pp. 1724– 1734.AssociationforComputationalLinguis- tics, Doha, Qatar (2014). DOI: 10.3115/v1/ D14-1179

work page doi:10.3115/v1/ 2014

[49] [49]

Vaswani, A., Shazeer, N., Parmar, N., Uszko- reit, J., Jones, L.,et al.: Attention is all you need. In: Adv. Neural Inf. Process. Syst., vol. 30. Curran Associates, Inc., Long Beach, CA, USA (2017). DOI: 10.48550/arXiv.1706. 03762

work page doi:10.48550/arxiv.1706 2017

[50] [50]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,et al.: An Image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. of the 9th Int. Conf. Learn. Represent. (ICLR) (2021). DOI: 10.48550/arXiv.2010.11929

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2021

[51] [51]

A ConvNet for the 2020s

He, K., Chen, X., Xie, S., Li, Y., Dollár, P.,et al.: Masked autoencoders are scalable vision learners. In: Proc. of the 2022 Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 16000– 16009. IEEE, New Orleans, LA, USA (2022). DOI: 10.1109/CVPR52688.2022.01553

work page doi:10.1109/cvpr52688.2022.01553 2022

[52] [52]

Graph Attention Networks

Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P.,et al.: Graph atten- tion networks. arXiv preprint (2017) DOI: 10.48550/arXiv.1710.10903

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1710.10903 2017

[53] [53]

Veličković, P.: Everything is connected: Graph neural networks. Curr. Opin. Struct. Biol. (COSB)79, 102538 (2023) DOI: 10. 1016/j.sbi.2023.102538 19

work page arXiv 2023

[54] [54]

Neural Comput.1, 270–280 (1989)

Williams, R.J., Zipser, D.: A learning algo- rithm for continually running fully recurrent neural networks. Neural Comput.1, 270–280 (1989)

work page 1989

[55] [55]

Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM13, 377–387 (1970)

work page 1970

[56] [56]

In: Proc

Boudaoud, A., Mahfoud, H., Chikh, A.: Towards a complete direct mapping from relational databases to property graphs. In: Proc. of the 38th Int. Conf. Data Eng. (ICDE), pp. 222–235. Springer, Kuala Lumpur, Malaysia (2022). DOI: 10.1007/ 978-3-031-21595-7_16

work page 2022

[57] [57]

In: Proc

Koch, S., Matveev, A., Jiang, Z., Williams, F., Artemov, A.,et al.: ABC: A big CAD model dataset for geometric deep learning. In: Proc. of the 2019 Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, pp. 9601–9611 (2019). DOI: 10.1109/ CVPR.2019.00983

work page arXiv 2019

[58] [58]

Galimberti, R.: An algorithm for hidden line elimination. Commun. ACM12, 206–211 (1969)

work page 1969

[59] [59]

Image Transformer

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N.,et al.: Image trans- former. In: Proc. of the 35th Int. Conf. Mach.Learn.(ICML),vol.80,pp.4055–4064. PMLR, Stockholm, Sweden (2018). DOI: 10. 48550/arXiv.1802.05751

work page internal anchor Pith review Pith/arXiv arXiv 2018

[60] [60]

Gaussian Error Linear Units (GELUs)

Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint (2016) DOI: 10.48550/arXiv.1606.08415

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1606.08415 2016

[61] [61]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proc. of the 8th Int. Conf. Learn. Represent. (ICLR) (2018). DOI: 10.48550/arXiv.1711.05101

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.05101 2018

[62] [62]

Ganda, D., Buch, R.: A survey on multi label classification.RecentTrendsinProgramming Languages5, 19–23 (2018)

work page 2018

[63] [63]

Tuggener, L., Emberger, R., Ghosh, A., Sager, P., Satyawan, Y.P.,et al.: Real world music object recognition. Trans. Int. Soc. MusicInf.Retr.7,1–14(2024)DOI:10.5334/ tismir.157 Appendix A Proof for read function space reduction Here, we provide the mathematical proof that it follows from Equation (1) that the read function space reduces to a single read f...

work page 2024