pith. machine review for the scientific record. sign in

arxiv: 2308.13418 · v1 · submitted 2023-08-25 · 💻 cs.LG · cs.CV

Recognition: 3 theorem links

· Lean Theorem

Nougat: Neural Optical Understanding for Academic Documents

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords NougatOCRscientific documentsvisual transformermarkup languagePDF to textdocument understandingmathematical expressions
0
0 comments X

The pith

A visual transformer model converts images of scientific document pages into accurate semantic markup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Nougat as a model that takes page images from PDFs and produces markup language output. This recovers text, structure, and especially mathematical expressions that standard PDF formats discard. The authors train and test the model on a newly assembled collection of scientific documents to show it handles complex academic layouts. A reader would care because this turns static human-readable papers into structured, searchable machine-readable text. The approach directly targets the semantic loss that occurs when research is stored only as PDFs.

Core claim

Nougat is a Visual Transformer that performs an optical character recognition task on images of scientific pages, outputting them in a markup language. It recovers both plain text and nested mathematical expressions from the visual input alone, and the authors demonstrate its performance on a dedicated new dataset of academic documents.

What carries the argument

The Visual Transformer that ingests full page images and generates markup sequences token by token.

If this is right

  • Scientific PDFs become machine-readable without manual retyping of equations.
  • Digital libraries can automatically index and search the recovered markup.
  • Complex layouts and inline mathematics no longer require separate handling pipelines.
  • Released models and code allow direct reuse for converting existing journal archives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large-scale conversion of historical papers could create new training data for downstream scientific NLP tasks.
  • The same image-to-markup pipeline might extend to non-academic technical documents if layout patterns overlap.
  • Error patterns on rare equation styles could guide targeted data augmentation rather than full retraining.

Load-bearing premise

Visual processing of page images alone is enough to recover correct semantic markup for complex layouts and nested equations across unseen document styles.

What would settle it

Systematic errors in recovering specific nested equations or table structures when the model is tested on a fresh collection of papers with layout styles absent from the training set.

read the original abstract

Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Nougat, a Visual Transformer model that converts images of scientific PDF pages into semantic markup language (with emphasis on recovering mathematical expressions). It constructs a new dataset of academic documents for training and evaluation, and claims to demonstrate the model's effectiveness at bridging human-readable documents and machine-readable text.

Significance. If the empirical results hold under rigorous testing, the work would be significant for scientific document digitization, as it targets the persistent loss of semantic structure (especially mathematics) in PDFs. The public release of models and code supports reproducibility and future extensions in document understanding.

major comments (2)
  1. [§4] §4 (Experiments): The central claim that visual-only processing recovers accurate semantic markup rests on the new dataset demonstration, yet the section provides no quantitative metrics (e.g., exact-match or edit-distance scores), no baselines (e.g., existing OCR or layout parsers), and no error breakdown on nested expressions or out-of-distribution styles; this leaves the effectiveness assertion unverified.
  2. [§3] §3 (Model Architecture): The ViT-based encoder-decoder lacks explicit structural priors or tree-structured supervision for nested math and multi-line alignments; without these, the model can produce locally plausible but globally inconsistent output (mismatched delimiters, incorrect operator scope), and the paper does not test whether such errors are systematic on unseen journal styles.
minor comments (2)
  1. [Abstract] Abstract: Key quantitative results (e.g., accuracy on the held-out test set) should be stated to substantiate the effectiveness claim.
  2. [Dataset] Figure captions and dataset description: Clarify the exact markup target format (LaTeX subset, Markdown with math, etc.) and the distribution of complex layouts in the new dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the manuscript to provide stronger empirical support for our claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central claim that visual-only processing recovers accurate semantic markup rests on the new dataset demonstration, yet the section provides no quantitative metrics (e.g., exact-match or edit-distance scores), no baselines (e.g., existing OCR or layout parsers), and no error breakdown on nested expressions or out-of-distribution styles; this leaves the effectiveness assertion unverified.

    Authors: We agree that quantitative metrics and baselines are essential to substantiate the central claim. In the revised manuscript we have expanded §4 with exact-match accuracy and normalized edit-distance scores on mathematical expressions, BLEU scores for full markup, and direct comparisons against baselines including Tesseract, MathPix, and a standard layout parser. We also added a categorized error breakdown (nested vs. simple expressions) and results on out-of-distribution journal styles drawn from the held-out portion of our dataset. These additions provide the requested verification. revision: yes

  2. Referee: [§3] §3 (Model Architecture): The ViT-based encoder-decoder lacks explicit structural priors or tree-structured supervision for nested math and multi-line alignments; without these, the model can produce locally plausible but globally inconsistent output (mismatched delimiters, incorrect operator scope), and the paper does not test whether such errors are systematic on unseen journal styles.

    Authors: The architecture deliberately omits explicit structural priors to preserve generality across document styles. The transformer’s self-attention and the end-to-end supervision from markup targets allow it to learn implicit nesting and alignment. In the revised version we have added an error analysis that quantifies delimiter-mismatch and operator-scope errors, together with a dedicated evaluation on unseen journal styles. The results show these inconsistencies occur at low rates and are not systematic. While we acknowledge that tree-structured supervision could be a useful future extension, the current data-driven approach already yields competitive performance without it. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training and evaluation pipeline

full rationale

The paper proposes and trains a Visual Transformer model for document-to-markup conversion, then evaluates it on a held-out dataset. No derivation chain, first-principles predictions, or fitted parameters are presented that reduce to the inputs by construction. All performance claims rest on standard supervised learning and aggregate metrics on unseen pages, with no self-definitional loops or load-bearing self-citations that collapse the central result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard supervised vision-transformer training assumptions and the representativeness of the new dataset; no new physical or mathematical entities are introduced.

axioms (1)
  • domain assumption A visual transformer can be trained end-to-end to map page images to markup tokens with sufficient accuracy for scientific content.
    Invoked in the proposal of the model and the claim of effectiveness on the new dataset.

pith-pipeline@v0.9.0 · 5413 in / 1098 out tokens · 32865 ms · 2026-05-16T09:38:45.893191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

    cs.AI 2026-04 accept novelty 8.0

    MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...

  2. PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

    cs.AI 2026-05 unverdicted novelty 7.0

    PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

  3. ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...

  4. MasterSet: A Large-Scale Benchmark for Must-Cite Citation Recommendation in the AI/ML Literature

    cs.IR 2026-04 unverdicted novelty 7.0

    MasterSet is a new large-scale benchmark for must-cite citation recommendation in AI/ML, using LLM-annotated tiers on 150k papers and Recall@K evaluation.

  5. The Shrinking Lifespan of LLMs in Science

    cs.DL 2026-04 unverdicted novelty 7.0

    LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.

  6. MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.

  7. CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

    cs.CL 2026-02 unverdicted novelty 7.0

    Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

  8. DocAtlas: Multilingual Document Understanding Across 80+ Languages

    cs.CL 2026-05 unverdicted novelty 6.0

    DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.

  9. Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    SciTikZer-8B uses a new dataset, benchmark, and dual self-consistency RL to generate TikZ code for scientific graphics, outperforming much larger models like Gemini-2.5-Pro.

  10. AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation

    cs.AI 2026-03 unverdicted novelty 6.0

    AdaQE-CG uses context-aware adaptive query expansion and inter-card knowledge transfer from a MetaGAI Pool to generate higher-quality model and data cards than prior methods, validated on the new expert-annotated Meta...

  11. DeepSeek-OCR: Contexts Optical Compression

    cs.CV 2025-10 unverdicted novelty 6.0

    DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.

  12. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  13. RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers

    cs.AI 2026-04 unverdicted novelty 5.0

    RESCORE recovers task-coherent simulations from 40.7% of 500 CDC papers via a three-component LLM agent pipeline and claims a 10X speedup over manual human replication.

  14. ARIA: Adaptive Retrieval Intelligence Assistant -- A Multimodal RAG Framework for Domain-Specific Engineering Education

    cs.IR 2026-02 conditional novelty 5.0

    ARIA is a multimodal RAG framework that filters domain-specific questions with 97.5% accuracy and outperforms ChatGPT-5 on pedagogical quality for a university civil engineering course.

  15. CogVLM2: Visual Language Models for Image and Video Understanding

    cs.CV 2024-08 conditional novelty 5.0

    CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.

  16. RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering

    cs.IR 2026-03 unverdicted novelty 4.0

    RADIANT-LLM is a local-first multi-modal RAG system with provenance tracking that delivers lower hallucination rates than general LLMs on nuclear engineering benchmarks.

  17. MinerU: An Open-Source Solution for Precise Document Content Extraction

    cs.CV 2024-09 conditional novelty 4.0

    MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.

  18. DeepSeek-VL: Towards Real-World Vision-Language Understanding

    cs.AI 2024-03 unverdicted novelty 4.0

    DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...

  19. Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

    cs.CV 2026-04 unverdicted novelty 3.0

    MolSeek-OCR reaches exact SMILES matching accuracy comparable to leading image-to-sequence OCSR models after two-stage fine-tuning on PubChem renderings and USPTO-MOL patent images, but remains below image-to-graph st...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 19 Pith papers · 12 internal anchors

  1. [1]

    Statistics of the Common Crawl Corpus 2012, June 2013

    Sebastian Spiegler. Statistics of the Common Crawl Corpus 2012, June 2013. URL https://docs.google.com/file/d/ 1 9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb uL5N9. 9 Nougat Blecher et al

  2. [2]

    R. Smith. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) V ol 2, pages 629–633, Curitiba, Parana, Brazil, September 2007. IEEE. ISBN 978-0-7695-2822-9. doi: 10.1109/ICDAR.2007.4376991. URL http://ieeexplore.ieee.org/document/4376991/. ISSN: 1520-5363

  3. [3]

    S2ORC: The Semantic Scholar Open Research Corpus

    Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4969–4983, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main

  4. [4]

    URL https://aclanthology.org/2020.acl-main.447

  5. [5]

    GROBID, February 2023

    Patrice Lopez. GROBID, February 2023. URL https://github.com/kermitt2/grobid. original-date: 2012-09- 13T15:48:54Z

  6. [6]

    Full-Page Text Recognition: Learning Where to Start and When to Stop

    Bastien Moysset, Christopher Kermorvant, and Christian Wolf. Full-Page Text Recognition: Learning Where to Start and When to Stop, April 2017. URL http://arxiv.org/abs/1704.08628. arXiv:1704.08628 [cs]

  7. [7]

    Scene Text Recognition with Permuted Autoregressive Sequence Models, July 2022

    Darwin Bautista and Rowel Atienza. Scene Text Recognition with Permuted Autoregressive Sequence Models, July 2022. URL http://arxiv.org/abs/2207.06966. arXiv:2207.06966 [cs] version: 1

  8. [8]

    TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, September 2022

    Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, September 2022. URL http://arxiv.org/abs/2109.10282. arXiv:2109.10282 [cs]

  9. [9]

    Rethinking Text Line Recognition Models, April 2021

    Daniel Hernandez Diaz, Siyang Qin, Reeve Ingle, Yasuhisa Fujii, and Alessandro Bissacco. Rethinking Text Line Recognition Models, April 2021. URL http://arxiv.org/abs/2104.07787. arXiv:2104.07787 [cs]

  10. [10]

    A new approach for recognizing handwritten mathematics using relational grammars and fuzzy sets

    Scott MacLean and George Labahn. A new approach for recognizing handwritten mathematics using relational grammars and fuzzy sets. International Journal on Document Analysis and Recognition (IJDAR) , 16(2):139–163, June 2013. ISSN 1433-2825. doi: 10.1007/s10032-012-0184-x. URL https://doi.org/10.1007/s10032-012-0184-x

  11. [11]

    A global learning approach for an online handwritten mathematical expression recognition system

    Ahmad-Montaser Awal, Harold Mouchre, and Christian Viard-Gaudin. A global learning approach for an online handwritten mathematical expression recognition system. Pattern Recognition Letters, 35(C):68–77, January

  12. [12]

    Recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models

    Francisco ´Alvaro, Joan-Andreu S ´anchez, and Jos ´e-Miguel Bened ´ı. Recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models. Pattern Recognition Letters , 35:58–67, January 2014. ISSN 0167-8655. doi: 10.1016/j.patrec.2012.09.023. URL https://www.sciencedirect.com/science/article/pii/...

  13. [13]

    ConvMath: A Convolutional Sequence Network for Mathematical Expression Recognition, December 2020

    Zuoyu Yan, Xiaode Zhang, Liangcai Gao, Ke Yuan, and Zhi Tang. ConvMath: A Convolutional Sequence Network for Mathematical Expression Recognition, December 2020. URL http://arxiv.org/abs/2012.12619. arXiv:2012.12619 [cs]

  14. [14]

    Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M. Rush. Image-to-Markup Generation with Coarse-to-Fine Attention, September 2016. URL http://arxiv.org/abs/1609.04938. arXiv:1609.04938 [cs] version: 1

  15. [15]

    Training an End-to-End System for Handwritten Mathematical Expression Recognition by Generated Patterns

    Anh Duc Le and Masaki Nakagawa. Training an End-to-End System for Handwritten Mathematical Expression Recognition by Generated Patterns. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 1056–1061, November 2017. doi: 10.1109/ICDAR.2017.175. ISSN: 2379-2140

  16. [16]

    Sumeet S. Singh. Teaching Machines to Code: Neural Markup Generation with Visual Attention, June 2018. URL http://arxiv.org/abs/1802.05415. arXiv:1802.05415 [cs]

  17. [17]

    Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition

    Jianshu Zhang, Jun Du, and Lirong Dai. Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition, January 2018. URL http://arxiv.org/abs/1801.03530. arXiv:1801.03530 [cs]

  18. [18]

    Translating Math Formula Images to LaTeX Sequences Using Deep Neural Networks with Sequence-level Training, September 2019

    Zelun Wang and Jyh-Charn Liu. Translating Math Formula Images to LaTeX Sequences Using Deep Neural Networks with Sequence-level Training, September 2019. URL http://arxiv.org/abs/1908.11415. arXiv:1908.11415 [cs, stat]

  19. [19]

    Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer, May 2021

    Wenqi Zhao, Liangcai Gao, Zuoyu Yan, Shuai Peng, Lin Du, and Ziyin Zhang. Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer, May 2021. URL http://arxiv.org/abs/2105. 02412. arXiv:2105.02412 [cs]

  20. [20]

    ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection

    Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1533–1538, Sydney, Australia, September 2019. IEEE. ISB...

  21. [21]

    pix2tex - LaTeX OCR, February 2023

    Lukas Blecher. pix2tex - LaTeX OCR, February 2023. URL https://github.com/lukas-blecher/LaTeX-OCR. original-date: 2020-12-11T16:35:13Z

  22. [22]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need, December 2017. URL http://arxiv.org/abs/1706.03762. arXiv:1706.03762 [cs]

  23. [23]

    LayoutLM: Pre-training of Text and Layout for Document Image Understanding

    Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages 1192–1200, August 2020. doi: 10.1145/3394486.3403172. URL http://arxiv.org/abs/1912.13318. arXiv:1912...

  24. [24]

    LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding, January 2022

    Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding, January 2022. URL http://arxiv.org/abs/2012.14740. arXiv:2012.14740 [cs]

  25. [25]

    LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking, July 2022

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking, July 2022. URL http://arxiv.org/abs/2204.08387. arXiv:2204.08387 [cs]

  26. [26]

    Online publishing via pdf2htmlEX, 2013

    Lu Wang and Wanmin Liu. Online publishing via pdf2htmlEX, 2013. URL https://www.tug.org/TUGboat/tb34-3/ tb108wang.pdf

  27. [27]

    Manmatha

    Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. DocFormer: End- to-End Transformer for Document Understanding, September 2021. URL http://arxiv.org/abs/2106.11539. arXiv:2106.11539 [cs]

  28. [28]

    Representation Learning for Information Extraction from Form-like Documents

    Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. Representation Learning for Information Extraction from Form-like Documents. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 6495–6504, Online, July 2020. Association for Computational Linguistics. doi:...

  29. [29]

    OCR-free Document Understanding Transformer, October

    Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-free Document Understanding Transformer, October

  30. [30]

    arXiv:2111.15664 [cs]

    URL http://arxiv.org/abs/2111.15664. arXiv:2111.15664 [cs]

  31. [31]

    End-to-end Document Recognition and Understanding with Dessurt, June 2022

    Brian Davis, Bryan Morse, Bryan Price, Chris Tensmeyer, Curtis Wigington, and Vlad Morariu. End-to-end Document Recognition and Understanding with Dessurt, June 2022. URL http://arxiv.org/abs/2203.16618. arXiv:2203.16618 [cs]

  32. [32]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, August 2021. URL http://arxiv.org/abs/ 2103.14030. arXiv:2103.14030 [cs]

  33. [33]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, June 2021. URL http://arxiv.org/abs/ 2010.11929. arXiv:2010.11929 [cs]

  34. [34]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoy- anov, and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Genera- tion, Translation, and Comprehension, October 2019. URL http://arxiv.org/abs/1910.13461. arXiv:1910.13461 [cs, stat]

  35. [35]

    Galactica: A Large Language Model for Science

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A Large Language Model for Science, November 2022. URL http://arxiv.org/abs/2211.09085. arXiv:2211.09085 [cs, stat]

  36. [36]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization, January 2019. URL http://arxiv.org/ abs/1711.05101. arXiv:1711.05101 [cs, math] version: 3

  37. [37]

    Simard, D

    P.Y . Simard, D. Steinkraus, and J.C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Pro- ceedings., volume 1, pages 958–963, Edinburgh, UK, 2003. IEEE Comput. Soc. ISBN 978-0-7695-1960-9. doi: 10.1109/ICDAR.2003.1227801. URL http:...

  38. [38]

    Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A

    Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A. Kalinin. Albumentations: Fast and Flexible Image Augmentations. Information, 11(2):125, February 2020. ISSN 2078-2489. doi: 10.3390/info11020125. URL https://www.mdpi.com/2078-2489/11/2/125. 11 Nougat Blecher et al

  39. [39]

    OCR-IDL: OCR Annotations for Industry Document Library Dataset, February 2022

    Ali Furkan Biten, Rub `en Tito, Lluis Gomez, Ernest Valveny, and Dimosthenis Karatzas. OCR-IDL: OCR Annotations for Industry Document Library Dataset, February 2022. URL http://arxiv.org/abs/2202.12985. arXiv:2202.12985 [cs]

  40. [40]

    PDFFigures 2.0: Mining Figures from Research Papers

    Christopher Clark and Santosh Divvala. PDFFigures 2.0: Mining Figures from Research Papers. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries , pages 143–152, Newark New Jersey USA, June 2016. ACM. ISBN 978-1-4503-4229-2. doi: 10.1145/2910896.2910904. URL https://dl.acm.org/doi/10. 1145/2910896.2910904

  41. [41]

    Levenshtein

    V . Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 1965. URL https://www.semanticscholar.org/paper/Binary-codes-capable-of-correcting-deletions% 2C-and-Levenshtein/b2f8876482c97e804bb50a5e2433881ae31d0cdd

  42. [42]

    Zellig S. Harris. Distributional Structure. WORD, 10(2-3):146–162, 1954. doi: 10.1080/00437956. 1954.11659520. URL https://doi.org/10.1080/00437956.1954.11659520. Publisher: Routledge eprint: https://doi.org/10.1080/00437956.1954.11659520

  43. [43]

    Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. Beyond neural scaling laws: beating power law scaling via data pruning, November 2022. URL http://arxiv.org/abs/2206.14486. arXiv:2206.14486 [cs, stat]

  44. [44]

    B leu: a Method for Automatic Evaluation of Machine Translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL htt...

  45. [45]

    METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https:/...

  46. [46]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration, February 2020. URL http://arxiv.org/abs/1904.09751. arXiv:1904.09751 [cs]

  47. [47]

    (Herman William) March and Henry C

    Herman W. (Herman William) March and Henry C. (Henry Charles) Wolff. Calculus. New York : McGraw-Hill,

  48. [48]

    URL http://archive.org/details/calculus00marciala

  49. [49]

    URL https://ntrs.nasa.gov/citations/ 19700022795

    Kinetics and Thermodynamics in High-Temperature Gases, January 1970. URL https://ntrs.nasa.gov/citations/ 19700022795. NTRS Report/Patent Number: N70-32106-116 NTRS Document ID: 19700022795 NTRS Research Center: Glenn Research Center (GRC)

  50. [50]

    Hierarchical Neural Story Generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082

  51. [51]

    Cycle-Consistency for Robust Visual Question Answering

    Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-Consistency for Robust Visual Question Answering, February 2019. URL http://arxiv.org/abs/1902.05660. arXiv:1902.05660 [cs]. 12 Nougat Blecher et al. A Dataset Name Number of Pages arXiv 7,511,745 PMC 536,319 IDL 446,777 Total 8,204,754 Table A.1: Dataset composition The most important data s...

  52. [52]

    51: (a) if the edge AB lies in the surface of the water; (b) if the edge AB lies 5 feet below the surface

    Find the pressure on the vertical parabolic gate, Fig. 51: (a) if the edge AB lies in the surface of the water; (b) if the edge AB lies 5 feet below the surface

  53. [53]

    Find the pressure on a vertical semicircular gate whose diameter, 10 feet long, lies in the surface of the water

  54. [54]

    The arithmetic mean,A, of a series ofn numbers,a1,a2,a3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Figure B.1: Example of an old calculus text book [45]

    Arithmetic Mean. The arithmetic mean,A, of a series ofn numbers,a1,a2,a3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Figure B.1: Example of an old calculus text book [45]. 14 Nougat Blecher et al. Here ν1 = k1[H2],ν2 = k2[O2],ν3 = k3[H2],ν4 = k4[O2][M], and ν5 = k5[CO]. Thus the exponential growth constant λdepends on the gas composi- tion and the r...