pith. sign in

arxiv: 2507.08458 · v2 · submitted 2025-07-11 · 💻 cs.CV · cs.AI

A document is worth a structured record: Principled inductive bias design for document recognition

Pith reviewed 2026-05-19 04:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords document recognitioninductive biastransformer architectureengineering drawingsstructured recordsend-to-end modelsrelational biasestranscription task
0
0 comments X

The pith

Treating document recognition as transcription to structured records allows design of relational inductive biases in transformers that enable end-to-end models for complex document types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard computer vision approaches to document recognition ignore the convention-driven structures that encode precise information in many document types, forcing reliance on heuristic post-processing. Framing the task instead as transcription from document to record naturally groups documents by shared structural properties in their output, so related types can be handled and learned together. The authors propose a method to encode these structures as relational inductive biases inside a base transformer architecture and adapt the same architecture across different record structures. Experiments on monophonic sheet music, shape drawings, and simplified engineering drawings demonstrate the approach, with the key result being the first successful end-to-end transcription of mechanical engineering drawings to inherently interlinked information. A sympathetic reader would care because the method offers a systematic way to handle less frequent or more intricate document types without custom post-processing pipelines.

Core claim

The authors establish that integrating an inductive bias for unrestricted graph structures into a base transformer architecture produces the first successful end-to-end model for transcribing mechanical engineering drawings to their inherently interlinked information. This follows from designing structure-specific relational inductive biases that capture the intrinsic, convention-driven properties of each document type, allowing the same architecture to be adapted across record structures while eliminating dependence on heuristic post-processing.

What carries the argument

A base transformer architecture adapted with structure-specific relational inductive biases that encode the convention-driven structures of document types for direct transcription to structured records.

If this is right

  • Documents sharing similar transcription structures can be grouped and learned together within the same adapted architecture.
  • End-to-end recognition becomes feasible for document types whose output records contain complex interlinked information.
  • The same base architecture can be reused across document types by swapping only the relational inductive bias.
  • The design principle offers a template for building future document foundation models without type-specific post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding structural knowledge directly via biases could lower the amount of labeled data needed when adapting to new document types.
  • The same bias-design approach might extend to non-document structured transcription tasks such as scientific diagrams or circuit schematics.
  • Applying the method to noisy real-world scans rather than simplified drawings would test whether the biases remain effective under realistic conditions.

Load-bearing premise

The intrinsic, convention-driven structures of document types can be effectively captured as relational inductive biases inside a transformer architecture.

What would settle it

An experiment in which the adapted transformer for mechanical engineering drawings fails to output accurate interlinked records end-to-end and still requires heuristic post-processing to reach usable accuracy.

read the original abstract

Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, many state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific relational inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe mechanical engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper frames document recognition as transcription from document to structured record, proposing a method to design structure-specific relational inductive biases for transformer models. It introduces a base architecture adaptable to different structures and demonstrates effectiveness through progressive experiments on monophonic sheet music, shape drawings, and simplified engineering drawings. The central claim is that an inductive bias for unrestricted graph structures enables the first successful end-to-end model for transcribing mechanical engineering drawings to inherently interlinked information without heuristic post-processing.

Significance. If the experimental claims are substantiated with metrics and details, this could offer a principled methodology for embedding document-type conventions as inductive biases in end-to-end systems, potentially extending modern recognition techniques to complex or infrequent document types like engineering drawings and informing the design of unified document foundation models.

major comments (2)
  1. Abstract: The assertion of the 'first-ever successful end-to-end model' for mechanical engineering drawings to interlinked information is presented without any quantitative metrics, baselines, error analysis, dataset descriptions, or comparisons, rendering the central effectiveness claim unverifiable from the provided text.
  2. Abstract and method description: The mechanism by which the transformer with an inductive bias for unrestricted graph structures directly outputs variable-sized arbitrary graphs (nodes and edges) without any post-processing step such as thresholding or rule-based assembly is not specified; standard transformer outputs are sequential or fixed, so the end-to-end claim requires an explicit output representation that is not detailed.
minor comments (2)
  1. Abstract: The qualifier 'simplified engineering drawings' is introduced without defining the nature or extent of the simplifications, which is necessary to evaluate whether the unrestricted graph bias generalizes to real-world, unrestricted connectivity.
  2. Abstract: The progressive experiments are described at a high level ('extensive experiments with progressively complex record structures') but lack any reference to specific tables, figures, or quantitative results that would allow assessment of the inductive bias contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to improve clarity and substantiation of the central claims.

read point-by-point responses
  1. Referee: Abstract: The assertion of the 'first-ever successful end-to-end model' for mechanical engineering drawings to interlinked information is presented without any quantitative metrics, baselines, error analysis, dataset descriptions, or comparisons, rendering the central effectiveness claim unverifiable from the provided text.

    Authors: We agree that the abstract, as a high-level summary, does not include these details. The full manuscript reports quantitative metrics, baselines, error analysis, and dataset descriptions for the engineering drawings experiments in the Experiments section. To make the claim more verifiable at a glance, we have added a concise reference to the achieved performance improvements in the revised abstract. revision: yes

  2. Referee: Abstract and method description: The mechanism by which the transformer with an inductive bias for unrestricted graph structures directly outputs variable-sized arbitrary graphs (nodes and edges) without any post-processing step such as thresholding or rule-based assembly is not specified; standard transformer outputs are sequential or fixed, so the end-to-end claim requires an explicit output representation that is not detailed.

    Authors: The architecture uses an autoregressive transformer decoder that generates a linearized sequence of node and edge tokens according to a fixed schema derived from the graph inductive bias; this sequence is directly parsed into the variable-sized graph without thresholding or rule-based assembly. We have expanded the method section in the revision to explicitly describe this output representation and how the unrestricted graph bias enables it. revision: yes

Circularity Check

0 steps flagged

No circularity: inductive bias design and end-to-end transcription claims rest on independent methodological proposal plus experiments

full rationale

The paper frames document recognition as transcription to structured records, proposes a base transformer with structure-specific relational inductive biases, and reports experimental results on monophonic music, shape drawings, and simplified engineering drawings. The central claim of the first successful end-to-end model for unrestricted graph outputs is presented as the outcome of integrating the proposed bias, not as a quantity derived by construction from fitted parameters or prior self-referential definitions. No equations are shown that reduce a prediction to an input fit; no uniqueness theorem or ansatz is imported via self-citation in a load-bearing way; the derivation chain from document structure to bias design to model architecture remains self-contained and externally validated by the progressive experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that documents possess intrinsic structures suitable for relational modeling; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information.
    Invoked in the opening of the abstract as the foundation for reframing recognition as transcription.

pith-pipeline@v0.9.0 · 5809 in / 1231 out tokens · 75070 ms · 2026-05-19T04:54:09.556264+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR

    cs.SD 2026-04 unverdicted novelty 5.0

    A two-stage OMR pipeline decodes symbol candidates into polyphonic score structures via topology recognition with probability-guided search.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    In: Proc

    Stadelmann, T., Amirian, M., Arabaci, I., Arnold, M., Duivesteijn, G.F.,et al.: Deep learning in the wild. In: Proc. of the Arti- ficial Neural Networks in Pattern Recogni- tion 8th IAPR TC3 Workshop, pp. 17–38. Springer, Siena, Italy (2018). DOI: 10.1007/ 978-3-319-99978-4_2

  2. [2]

    Masset, R

    Chai, J., Zeng, H., Li, A., Ngai, E.W.: Deep learning in computer vision: A critical review of emerging techniques and application sce- narios. Machine Learning with Applications 6, 100134–100147 (2021) DOI: 10.1016/j. mlwa.2021.100134

  3. [3]

    arXiv preprint (2020) DOI: 10

    Subramani, N., Matton, A., Greaves, M., Lam, A.: A survey of deep learning approaches for OCR and document under- standing. arXiv preprint (2020) DOI: 10. 48550/arXiv.2011.13534 16

  4. [4]

    In: Proc

    Ríos-Vila, A., Calvo-Zaragoza, J., Paquet, T.: Sheet Music Transformer: End-to-end optical music recognition beyond mono- phonic transcription. In: Proc. of the 18th Int. Conf. Doc. Anal. Recognit. (ICDAR), pp. 20–37. Springer, Athens, Greece (2024). DOI: 10.1007/978-3-031-70552-6_2

  5. [5]

    In: Proc

    Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolu- tional neural networks for newspaper article segmentation. In: Proc. of the 14th Int. Conf. Doc. Anal. Recognit. (ICDAR), vol. 1, pp. 414–419 (2017). DOI: 10.1109/ICDAR.2017. 75

  6. [6]

    In: Proc

    Li, M., Lv, T., Chen, J., Cui, L., Lu, Y.,et al.: TrOCR: Transformer-based optical char- acter recognition with pre-trained models. In: Proc. of the 37th AAAI Conf. Artif. Intell., Washington, DC, USA, pp. 13094–13102 (2023). DOI: 10.1609/aaai.v37i11.26538

  7. [7]

    IEEE Access12, 76963–76974 (2024) DOI: 10.1109/ACCESS.2024.3404834

    Schmitt-Koopmann, F.M., Huang, E.M., Hutter, H.-P., Stadelmann, T., Darvishy, A.: MathNet: A data-centric approach for printed mathematical expression recognition. IEEE Access12, 76963–76974 (2024) DOI: 10.1109/ACCESS.2024.3404834

  8. [8]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    Wei, H., Liu, C., Chen, J., Wang, J., Kong, L.,et al.: General OCR theory: Towards OCR-2.0 via a unified end-to-end model. arXiv preprint (2024) DOI: 10.48550/arXiv. 2409.01704

  9. [9]

    arXiv preprint (2022) DOI: 10.48550/arXiv.2204.13277

    Sarkar, S., Pandey, P., Kar, S.: Automatic detection and classification of symbols in engineering drawings. arXiv preprint (2022) DOI: 10.48550/arXiv.2204.13277

  10. [10]

    Score-cam: Score-weighted visual explanations for convolutional neural net- works

    Rezvanifar, A., Cote, M., Albu, A.B.: Sym- bol spotting on digital architectural floor plans using a deep learning-based framework. In: Proc. of the 2020 Conf. Comput. Vis. Pattern Recognit. Workshops, Seattle, WA, USA, pp. 568–569 (2020). DOI: 10.1109/ CVPRW50498.2020.00292

  11. [11]

    Research Square preprint (2023) DOI: 10

    Uzair, W., Chai, D., Rassau, A.: ElectroNet: An enhanced model for small-scale object detection in electrical schematic diagram. Research Square preprint (2023) DOI: 10. 21203/rs.3.rs-3137489/v1

  12. [12]

    In: Proc

    Mardiana, B.D., Hadiningrum, T.R., Sia- haan, D.: Comparative analysis of deep learn- ing models for validating use case diagrams. In: Proc. of the 16th Int. Conf. Inf. Tech- nol. Electr. Eng. (ICITEE), pp. 141–146. IEEE, Bali, Indonesia (2024). DOI: 10.1109/ ICITEE62483.2024.10808842

  13. [13]

    In: Proc

    Gada, M.: Object detection for P&ID images using various deep learning techniques. In: Proc. of the 2021 Int. Conf. Comput. Commun. Inform. ICCCI, pp. 1–5. IEEE, Coimbatore, India (2021). DOI: 10.1109/ ICCCI50826.2021.9402386

  14. [14]

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV)115, 211–252 (2015) DOI: 10. 1007/s11263-015-0816-y

  15. [15]

    Kristianto, G

    Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N.,et al.: Image segmen- tation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 44, 3523–3542 (2021) DOI: 10.1109/TPAMI. 2021.3059968

  16. [16]

    Zou, Z., Chen, K., Shi, Z., Guo, Y., Ye, J.: Object detection in 20 years: A survey. Proc. of the IEEE111(3), 257–276 (2023) DOI: 10. 1109/JPROC.2023.3238524

  17. [17]

    Deep Watershed Detector for Music Object Recognition

    Tuggener, L., Elezi, I., Schmidhuber, J., Stadelmann, T.: Deep watershed detector for music object recognition. arXiv preprint (2018) DOI: 10.48550/arXiv.1805.10548

  18. [18]

    In: Proc

    Yamasaki, T., Zhang, J., Takada, Y.: Apart- ment structure estimation using fully convo- lutional networks and graph model. In: Proc. of the 2018 ACM Workshop on Multime- dia for Real Estate Tech. RETech’18, pp. 1–6. Association for Computing Machinery, New York, NY, USA (2018). DOI: 10.1145/ 3210499.32105

  19. [19]

    Sensors20(23), 6896–6910 (2020) DOI: 10.3390/s20236896

    Buzzy, M., Thesma, V., Davoodi, M., Mohammadpour Velni, J.: Real-time plant 17 leaf counting using deep object detection networks. Sensors20(23), 6896–6910 (2020) DOI: 10.3390/s20236896

  20. [20]

    Applied Intel- ligence51, 6400–6429 (2021) DOI: 10.1007/ s10489-021-02293-7

    Pal, S.K., Pramanik, A., Maiti, J., Mitra, P.: Deep learning in multi-object detection and tracking: state of the art. Applied Intel- ligence51, 6400–6429 (2021) DOI: 10.1007/ s10489-021-02293-7

  21. [21]

    Sen- sors24(15), 4861–4882 (2024) DOI: 10.3390/ s24154861

    Khor, K.S., Liu, C., Cheah, C.C.: Robotic grasping of unknown objects based on deep learning-based feature detection. Sen- sors24(15), 4861–4882 (2024) DOI: 10.3390/ s24154861

  22. [22]

    In: Proc

    Hays, J., Efros, A.A.: IM2GPS: Estimating geographic information from a single image. In: Proc. of the 2008 Conf. Comput. Vis. Pat- tern Recognit., pp. 1–8. IEEE, Anchorage, AK, USA (2008). DOI: 10.1109/CVPR.2008. 4587784

  23. [23]

    Wilson, R.J.: Introduction to Graph Theory, 4thedn.AddisonWesley,Harlow,UK(1986)

  24. [24]

    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Adv. Neural Inf. Process. Syst., vol. 27. Montréal, Kanada (2014). DOI: 10.48550/ arXiv.1409.3215

  25. [25]

    olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

    Poznanski, J., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D.,et al.: olmOCR: Unlock- ing trillions of tokens in PDFs with vision language models. arXiv preprint (2025) DOI: 10.48550/arXiv.2502.18443

  26. [26]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., et al.: Qwen2-VL: Enhancing vision-language model’s perception of the world at any reso- lution. arXiv preprint (2024) DOI: 10.48550/ arXiv.2409.12191

  27. [27]

    Asurveyonhypothesisgenerationforsci- entific discovery in the era of large language models

    Hamdi, L., Tamasna, A., Boisson, P., Paquet, T.: VISTA-OCR: Towards generative and interactive end to end OCR models. arXiv preprint (2025) DOI: 10.48550/arXiv.2504. 03621

  28. [28]

    arXiv preprint (2024) DOI: 10.48550/arXiv.2410.19494

    Xypolopoulos, C., Shang, G., Fei, X., Niko- lentzos, G., Abdine, H.,et al.: Graph lin- earization methods for reasoning on graphs with large language models. arXiv preprint (2024) DOI: 10.48550/arXiv.2410.19494

  29. [29]

    In: Proc

    Hajic, J., Dorfer, M., Widmer, G., Pecina, P.: Towards full-pipeline handwritten OMR with musical symbol detection by U-nets. In: Proc. of the 19th Trans. Int. Soc. Music Inf. Retr. (ISMIR), Paris, France, pp. 225–232 (2018). DOI: 10.5281/zenodo.1492388

  30. [30]

    , author Weiss, Y

    Tuggener, L., Satyawan, Y.P., Pacha, A., Schmidhuber, J., Stadelmann, T.: The Deep- ScoresV2 dataset and benchmark for music object detection. In: 2020 25th Int. Conf. on Pat. Recog. (ICPR), pp. 9188–9195. IEEE, Milan, Italy (2021). DOI: 10.1109/ ICPR48806.2021.9412290

  31. [31]

    IEEE Access8, 199523–199538 (2020) https://doi.org/10.1109/ACCESS

    Schmitt-Koopmann, F.M., Huang, E.M., Hutter, H.-P., Stadelmann, T., Darvishy, A.: FormulaNet: A benchmark dataset for math- ematical formula detection. IEEE Access10, 91588–91596 (2022) DOI: 10.1109/ACCESS. 2022.3202639

  32. [32]

    ISPRS Int

    Kim, H., Kim, S., Yu, K.: Automatic extrac- tion of indoor spatial information from floor plan image: A patch-based deep learning methodology application on large-scale com- plex buildings. ISPRS Int. J. Geo-Inf.10, 828–843 (2021) DOI: 10.3390/ijgi10120828

  33. [33]

    Applied Sciences10, 7347–7362 (2020) DOI: 10.3390/app10207347

    Seo, J., Park, H., Choo, S.: Inference of draw- ing elements and space usage on architec- tural drawings using semantic segmentation. Applied Sciences10, 7347–7362 (2020) DOI: 10.3390/app10207347

  34. [34]

    In: Proc

    Huber, F., Hagel, G.: Towards detection and syntactical analysis in UML class diagrams for software engineering education. In: Proc. of the 2020 IEEE Glob. Eng. Educ. Conf. (EDUCON), pp. 3–7. IEEE, Porto, Portugal (2020). DOI: 10.1109/EDUCON45650.2020. 9125244

  35. [35]

    McGraw-Hill, New York, NY (2013) 18

    Rosen, K.H., Krithivasan, K.: Discrete Math- ematics and Its Applications, 7th edn. McGraw-Hill, New York, NY (2013) 18

  36. [36]

    W., Holmes, C

    Lee, K., Joshi, M., Turc, I.R., Hu, H., Liu, F., et al.: Pix2Struct: Screenshot parsing as pre- training for visual language understanding. In: Proc. of the 40th Int. Conf. Mach. Learn. (ICML), pp. 18893–18912. PMLR, Honolulu, HI, USA (2023). DOI: 10.48550/arXiv.2210. 03347

  37. [37]

    Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

    Zhang, Q., Huang, V.S.-J., Wang, B., Zhang, J., Wang, Z.,et al.: Document pars- ing unveiled: Techniques, challenges, and prospects for structured information extrac- tion. arXiv preprint (2024) DOI: 10.48550/ arXiv.2410.21169

  38. [38]

    Nougat: Neural Optical Understanding for Academic Documents

    Blecher, L., Cucurull, G., Scialom, T., Sto- jnic, R.: Nougat: Neural optical understand- ing for academic documents. arXiv preprint (2023) DOI: 10.48550/arXiv.2308.13418

  39. [39]

    arXiv preprint (2024) DOI: 10.48550/ arXiv.2403.12895

    Hu, A., Xu, H., Ye, J., Yan, M., Zhang, L., et al.:mPLUG-DocOwl1.5:Unifiedstructure learning for OCR-free document understand- ing. arXiv preprint (2024) DOI: 10.48550/ arXiv.2403.12895

  40. [40]

    Archives of Data Sci- ence, Series A8, 1–16 (2022) DOI: 10.5445/ IR/1000143637

    Stadelmann, T., Klamt, T., Merkt, P.H.: Data centrism and the core of data science as a scientific discipline. Archives of Data Sci- ence, Series A8, 1–16 (2022) DOI: 10.5445/ IR/1000143637

  41. [41]

    In: Proc

    Luley, P.-P., Deriu, J.M., Yan, P., Schatte, G.A., Stadelmann, T.: From concept to implementation: The data-centric develop- ment process for AI in industry. In: Proc. of the 10th Swiss Conf. Data Sci. (SDS), pp. 73–

  42. [42]

    DOI: 10.1109/SDS57534.2023.00017

    IEEE, Zurich, Switzerland (2023). DOI: 10.1109/SDS57534.2023.00017

  43. [43]

    In: Proc

    Tuggener, L., Sager, P., Taoudi- Benchekroun, Y., Grewe, B.F., Stadelmann, T.: So you want your private LLM at home? A survey and benchmark of methods for efficient GPTs. In: Proc. of the 11th Swiss Conf. Data Sci. (SDS), pp. 205–212. IEEE, Zurich, Switzerland (2024). DOI: 10.1109/SDS60720.2024.00036

  44. [44]

    Nienhuys, H.-W., Nieuwenhuizen, J.: Lily- Pond – Essay on automated music engraving. (2003)

  45. [45]

    In: AMW (2018)

    Angles, R.: The property graph database model. In: AMW (2018)

  46. [46]

    (unpublished) technical report at Aston Uni- versity (1994)

    Bishop, C.M.: Mixture density networks. (unpublished) technical report at Aston Uni- versity (1994)

  47. [47]

    Version 0.9.2, 2022-06- 27 (2022)

    LeCun, Y.: A path towards autonomous machine intelligence. Version 0.9.2, 2022-06- 27 (2022)

  48. [48]

    Compositional semantic parsing on semi-structured tables

    Cho, K., Merriënboer, B., Gulcehre, C., Bah- danau, D., Bougares, F.,et al.: Learning phrase representations using RNN encoder– decoder for statistical machine translation. In: Proc. of the 2014 Conf. Empir. Methods Nat. Lang. Process. (EMNLP), pp. 1724– 1734.AssociationforComputationalLinguis- tics, Doha, Qatar (2014). DOI: 10.3115/v1/ D14-1179

  49. [49]

    Vaswani, A., Shazeer, N., Parmar, N., Uszko- reit, J., Jones, L.,et al.: Attention is all you need. In: Adv. Neural Inf. Process. Syst., vol. 30. Curran Associates, Inc., Long Beach, CA, USA (2017). DOI: 10.48550/arXiv.1706. 03762

  50. [50]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,et al.: An Image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. of the 9th Int. Conf. Learn. Represent. (ICLR) (2021). DOI: 10.48550/arXiv.2010.11929

  51. [51]

    A ConvNet for the 2020s

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P.,et al.: Masked autoencoders are scalable vision learners. In: Proc. of the 2022 Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 16000– 16009. IEEE, New Orleans, LA, USA (2022). DOI: 10.1109/CVPR52688.2022.01553

  52. [52]

    Graph Attention Networks

    Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P.,et al.: Graph atten- tion networks. arXiv preprint (2017) DOI: 10.48550/arXiv.1710.10903

  53. [53]

    Veličković, P.: Everything is connected: Graph neural networks. Curr. Opin. Struct. Biol. (COSB)79, 102538 (2023) DOI: 10. 1016/j.sbi.2023.102538 19

  54. [54]

    Neural Comput.1, 270–280 (1989)

    Williams, R.J., Zipser, D.: A learning algo- rithm for continually running fully recurrent neural networks. Neural Comput.1, 270–280 (1989)

  55. [55]

    Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM13, 377–387 (1970)

  56. [56]

    In: Proc

    Boudaoud, A., Mahfoud, H., Chikh, A.: Towards a complete direct mapping from relational databases to property graphs. In: Proc. of the 38th Int. Conf. Data Eng. (ICDE), pp. 222–235. Springer, Kuala Lumpur, Malaysia (2022). DOI: 10.1007/ 978-3-031-21595-7_16

  57. [57]

    In: Proc

    Koch, S., Matveev, A., Jiang, Z., Williams, F., Artemov, A.,et al.: ABC: A big CAD model dataset for geometric deep learning. In: Proc. of the 2019 Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, pp. 9601–9611 (2019). DOI: 10.1109/ CVPR.2019.00983

  58. [58]

    Galimberti, R.: An algorithm for hidden line elimination. Commun. ACM12, 206–211 (1969)

  59. [59]

    Image Transformer

    Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N.,et al.: Image trans- former. In: Proc. of the 35th Int. Conf. Mach.Learn.(ICML),vol.80,pp.4055–4064. PMLR, Stockholm, Sweden (2018). DOI: 10. 48550/arXiv.1802.05751

  60. [60]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint (2016) DOI: 10.48550/arXiv.1606.08415

  61. [61]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proc. of the 8th Int. Conf. Learn. Represent. (ICLR) (2018). DOI: 10.48550/arXiv.1711.05101

  62. [62]

    Ganda, D., Buch, R.: A survey on multi label classification.RecentTrendsinProgramming Languages5, 19–23 (2018)

  63. [63]

    Tuggener, L., Emberger, R., Ghosh, A., Sager, P., Satyawan, Y.P.,et al.: Real world music object recognition. Trans. Int. Soc. MusicInf.Retr.7,1–14(2024)DOI:10.5334/ tismir.157 Appendix A Proof for read function space reduction Here, we provide the mathematical proof that it follows from Equation (1) that the read function space reduces to a single read f...