arxiv: 2308.13418 · v1 · submitted 2023-08-25 · 💻 cs.LG · cs.CV

Recognition: 3 theorem links

· Lean Theorem

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher , Guillem Cucurull , Thomas Scialom , Robert Stojnic

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords NougatOCRscientific documentsvisual transformermarkup languagePDF to textdocument understandingmathematical expressions

0 comments

The pith

A visual transformer model converts images of scientific document pages into accurate semantic markup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Nougat as a model that takes page images from PDFs and produces markup language output. This recovers text, structure, and especially mathematical expressions that standard PDF formats discard. The authors train and test the model on a newly assembled collection of scientific documents to show it handles complex academic layouts. A reader would care because this turns static human-readable papers into structured, searchable machine-readable text. The approach directly targets the semantic loss that occurs when research is stored only as PDFs.

Core claim

Nougat is a Visual Transformer that performs an optical character recognition task on images of scientific pages, outputting them in a markup language. It recovers both plain text and nested mathematical expressions from the visual input alone, and the authors demonstrate its performance on a dedicated new dataset of academic documents.

What carries the argument

The Visual Transformer that ingests full page images and generates markup sequences token by token.

If this is right

Scientific PDFs become machine-readable without manual retyping of equations.
Digital libraries can automatically index and search the recovered markup.
Complex layouts and inline mathematics no longer require separate handling pipelines.
Released models and code allow direct reuse for converting existing journal archives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large-scale conversion of historical papers could create new training data for downstream scientific NLP tasks.
The same image-to-markup pipeline might extend to non-academic technical documents if layout patterns overlap.
Error patterns on rare equation styles could guide targeted data augmentation rather than full retraining.

Load-bearing premise

Visual processing of page images alone is enough to recover correct semantic markup for complex layouts and nested equations across unseen document styles.

What would settle it

Systematic errors in recovering specific nested equations or table structures when the model is tested on a fresh collection of papers with layout styles absent from the training set.

read the original abstract

Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Nougat, a Visual Transformer model that converts images of scientific PDF pages into semantic markup language (with emphasis on recovering mathematical expressions). It constructs a new dataset of academic documents for training and evaluation, and claims to demonstrate the model's effectiveness at bridging human-readable documents and machine-readable text.

Significance. If the empirical results hold under rigorous testing, the work would be significant for scientific document digitization, as it targets the persistent loss of semantic structure (especially mathematics) in PDFs. The public release of models and code supports reproducibility and future extensions in document understanding.

major comments (2)

[§4] §4 (Experiments): The central claim that visual-only processing recovers accurate semantic markup rests on the new dataset demonstration, yet the section provides no quantitative metrics (e.g., exact-match or edit-distance scores), no baselines (e.g., existing OCR or layout parsers), and no error breakdown on nested expressions or out-of-distribution styles; this leaves the effectiveness assertion unverified.
[§3] §3 (Model Architecture): The ViT-based encoder-decoder lacks explicit structural priors or tree-structured supervision for nested math and multi-line alignments; without these, the model can produce locally plausible but globally inconsistent output (mismatched delimiters, incorrect operator scope), and the paper does not test whether such errors are systematic on unseen journal styles.

minor comments (2)

[Abstract] Abstract: Key quantitative results (e.g., accuracy on the held-out test set) should be stated to substantiate the effectiveness claim.
[Dataset] Figure captions and dataset description: Clarify the exact markup target format (LaTeX subset, Markdown with math, etc.) and the distribution of complex layouts in the new dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the manuscript to provide stronger empirical support for our claims.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim that visual-only processing recovers accurate semantic markup rests on the new dataset demonstration, yet the section provides no quantitative metrics (e.g., exact-match or edit-distance scores), no baselines (e.g., existing OCR or layout parsers), and no error breakdown on nested expressions or out-of-distribution styles; this leaves the effectiveness assertion unverified.

Authors: We agree that quantitative metrics and baselines are essential to substantiate the central claim. In the revised manuscript we have expanded §4 with exact-match accuracy and normalized edit-distance scores on mathematical expressions, BLEU scores for full markup, and direct comparisons against baselines including Tesseract, MathPix, and a standard layout parser. We also added a categorized error breakdown (nested vs. simple expressions) and results on out-of-distribution journal styles drawn from the held-out portion of our dataset. These additions provide the requested verification. revision: yes
Referee: [§3] §3 (Model Architecture): The ViT-based encoder-decoder lacks explicit structural priors or tree-structured supervision for nested math and multi-line alignments; without these, the model can produce locally plausible but globally inconsistent output (mismatched delimiters, incorrect operator scope), and the paper does not test whether such errors are systematic on unseen journal styles.

Authors: The architecture deliberately omits explicit structural priors to preserve generality across document styles. The transformer’s self-attention and the end-to-end supervision from markup targets allow it to learn implicit nesting and alignment. In the revised version we have added an error analysis that quantifies delimiter-mismatch and operator-scope errors, together with a dedicated evaluation on unseen journal styles. The results show these inconsistencies occur at low rates and are not systematic. While we acknowledge that tree-structured supervision could be a useful future extension, the current data-driven approach already yields competitive performance without it. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training and evaluation pipeline

full rationale

The paper proposes and trains a Visual Transformer model for document-to-markup conversion, then evaluates it on a held-out dataset. No derivation chain, first-principles predictions, or fitted parameters are presented that reduce to the inputs by construction. All performance claims rest on standard supervised learning and aggregate metrics on unseen pages, with no self-definitional loops or load-bearing self-citations that collapse the central result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard supervised vision-transformer training assumptions and the representativeness of the new dataset; no new physical or mathematical entities are introduced.

axioms (1)

domain assumption A visual transformer can be trained end-to-end to map page images to markup tokens with sufficient accuracy for scientific content.
Invoked in the proposal of the model and the claim of effectiveness on the new dataset.

pith-pipeline@v0.9.0 · 5413 in / 1098 out tokens · 32865 ms · 2026-05-16T09:38:45.893191+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
cs.AI 2026-04 accept novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
cs.AI 2026-05 unverdicted novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
MasterSet: A Large-Scale Benchmark for Must-Cite Citation Recommendation in the AI/ML Literature
cs.IR 2026-04 unverdicted novelty 7.0

MasterSet is a new large-scale benchmark for must-cite citation recommendation in AI/ML, using LLM-annotated tiers on 150k papers and Recall@K evaluation.
The Shrinking Lifespan of LLMs in Science
cs.DL 2026-04 unverdicted novelty 7.0

LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
cs.CL 2026-02 unverdicted novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

SciTikZer-8B uses a new dataset, benchmark, and dual self-consistency RL to generate TikZ code for scientific graphics, outperforming much larger models like Gemini-2.5-Pro.
AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation
cs.AI 2026-03 unverdicted novelty 6.0

AdaQE-CG uses context-aware adaptive query expansion and inter-card knowledge transfer from a MetaGAI Pool to generate higher-quality model and data cards than prior methods, validated on the new expert-annotated Meta...
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers
cs.AI 2026-04 unverdicted novelty 5.0

RESCORE recovers task-coherent simulations from 40.7% of 500 CDC papers via a three-component LLM agent pipeline and claims a 10X speedup over manual human replication.
ARIA: Adaptive Retrieval Intelligence Assistant -- A Multimodal RAG Framework for Domain-Specific Engineering Education
cs.IR 2026-02 conditional novelty 5.0

ARIA is a multimodal RAG framework that filters domain-specific questions with 97.5% accuracy and outperforms ChatGPT-5 on pedagogical quality for a university civil engineering course.
CogVLM2: Visual Language Models for Image and Video Understanding
cs.CV 2024-08 conditional novelty 5.0

CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering
cs.IR 2026-03 unverdicted novelty 4.0

RADIANT-LLM is a local-first multi-modal RAG system with provenance tracking that delivers lower hallucination rates than general LLMs on nuclear engineering benchmarks.
MinerU: An Open-Source Solution for Precise Document Content Extraction
cs.CV 2024-09 conditional novelty 4.0

MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
cs.AI 2024-03 unverdicted novelty 4.0

DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition
cs.CV 2026-04 unverdicted novelty 3.0

MolSeek-OCR reaches exact SMILES matching accuracy comparable to leading image-to-sequence OCSR models after two-stage fine-tuning on PubChem renderings and USPTO-MOL patent images, but remains below image-to-graph st...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 19 Pith papers · 12 internal anchors

[1]

Statistics of the Common Crawl Corpus 2012, June 2013

Sebastian Spiegler. Statistics of the Common Crawl Corpus 2012, June 2013. URL https://docs.google.com/file/d/ 1 9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb uL5N9. 9 Nougat Blecher et al

work page 2012
[2]

R. Smith. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) V ol 2, pages 629–633, Curitiba, Parana, Brazil, September 2007. IEEE. ISBN 978-0-7695-2822-9. doi: 10.1109/ICDAR.2007.4376991. URL http://ieeexplore.ieee.org/document/4376991/. ISSN: 1520-5363

work page doi:10.1109/icdar.2007.4376991 2007
[3]

S2ORC: The Semantic Scholar Open Research Corpus

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4969–4983, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main

work page doi:10.18653/v1/2020.acl-main 2020
[4]

URL https://aclanthology.org/2020.acl-main.447

work page 2020
[5]

GROBID, February 2023

Patrice Lopez. GROBID, February 2023. URL https://github.com/kermitt2/grobid. original-date: 2012-09- 13T15:48:54Z

work page 2023
[6]

Full-Page Text Recognition: Learning Where to Start and When to Stop

Bastien Moysset, Christopher Kermorvant, and Christian Wolf. Full-Page Text Recognition: Learning Where to Start and When to Stop, April 2017. URL http://arxiv.org/abs/1704.08628. arXiv:1704.08628 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Scene Text Recognition with Permuted Autoregressive Sequence Models, July 2022

Darwin Bautista and Rowel Atienza. Scene Text Recognition with Permuted Autoregressive Sequence Models, July 2022. URL http://arxiv.org/abs/2207.06966. arXiv:2207.06966 [cs] version: 1

work page arXiv 2022
[8]

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, September 2022

Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, September 2022. URL http://arxiv.org/abs/2109.10282. arXiv:2109.10282 [cs]

work page arXiv 2022
[9]

Rethinking Text Line Recognition Models, April 2021

Daniel Hernandez Diaz, Siyang Qin, Reeve Ingle, Yasuhisa Fujii, and Alessandro Bissacco. Rethinking Text Line Recognition Models, April 2021. URL http://arxiv.org/abs/2104.07787. arXiv:2104.07787 [cs]

work page arXiv 2021
[10]

A new approach for recognizing handwritten mathematics using relational grammars and fuzzy sets

Scott MacLean and George Labahn. A new approach for recognizing handwritten mathematics using relational grammars and fuzzy sets. International Journal on Document Analysis and Recognition (IJDAR) , 16(2):139–163, June 2013. ISSN 1433-2825. doi: 10.1007/s10032-012-0184-x. URL https://doi.org/10.1007/s10032-012-0184-x

work page doi:10.1007/s10032-012-0184-x 2013
[11]

A global learning approach for an online handwritten mathematical expression recognition system

Ahmad-Montaser Awal, Harold Mouchre, and Christian Viard-Gaudin. A global learning approach for an online handwritten mathematical expression recognition system. Pattern Recognition Letters, 35(C):68–77, January

work page
[12]

Recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models

Francisco ´Alvaro, Joan-Andreu S ´anchez, and Jos ´e-Miguel Bened ´ı. Recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models. Pattern Recognition Letters , 35:58–67, January 2014. ISSN 0167-8655. doi: 10.1016/j.patrec.2012.09.023. URL https://www.sciencedirect.com/science/article/pii/...

work page doi:10.1016/j.patrec.2012.09.023 2014
[13]

ConvMath: A Convolutional Sequence Network for Mathematical Expression Recognition, December 2020

Zuoyu Yan, Xiaode Zhang, Liangcai Gao, Ke Yuan, and Zhi Tang. ConvMath: A Convolutional Sequence Network for Mathematical Expression Recognition, December 2020. URL http://arxiv.org/abs/2012.12619. arXiv:2012.12619 [cs]

work page arXiv 2020
[14]

Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M. Rush. Image-to-Markup Generation with Coarse-to-Fine Attention, September 2016. URL http://arxiv.org/abs/1609.04938. arXiv:1609.04938 [cs] version: 1

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Training an End-to-End System for Handwritten Mathematical Expression Recognition by Generated Patterns

Anh Duc Le and Masaki Nakagawa. Training an End-to-End System for Handwritten Mathematical Expression Recognition by Generated Patterns. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 1056–1061, November 2017. doi: 10.1109/ICDAR.2017.175. ISSN: 2379-2140

work page doi:10.1109/icdar.2017.175 2017
[16]

Sumeet S. Singh. Teaching Machines to Code: Neural Markup Generation with Visual Attention, June 2018. URL http://arxiv.org/abs/1802.05415. arXiv:1802.05415 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition

Jianshu Zhang, Jun Du, and Lirong Dai. Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition, January 2018. URL http://arxiv.org/abs/1801.03530. arXiv:1801.03530 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Translating Math Formula Images to LaTeX Sequences Using Deep Neural Networks with Sequence-level Training, September 2019

Zelun Wang and Jyh-Charn Liu. Translating Math Formula Images to LaTeX Sequences Using Deep Neural Networks with Sequence-level Training, September 2019. URL http://arxiv.org/abs/1908.11415. arXiv:1908.11415 [cs, stat]

work page arXiv 2019
[19]

Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer, May 2021

Wenqi Zhao, Liangcai Gao, Zuoyu Yan, Shuai Peng, Lin Du, and Ziyin Zhang. Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer, May 2021. URL http://arxiv.org/abs/2105. 02412. arXiv:2105.02412 [cs]

work page arXiv 2021
[20]

ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection

Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1533–1538, Sydney, Australia, September 2019. IEEE. ISB...

work page doi:10.1109/icdar.2019.00247 2019
[21]

pix2tex - LaTeX OCR, February 2023

Lukas Blecher. pix2tex - LaTeX OCR, February 2023. URL https://github.com/lukas-blecher/LaTeX-OCR. original-date: 2020-12-11T16:35:13Z

work page 2023
[22]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need, December 2017. URL http://arxiv.org/abs/1706.03762. arXiv:1706.03762 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages 1192–1200, August 2020. doi: 10.1145/3394486.3403172. URL http://arxiv.org/abs/1912.13318. arXiv:1912...

work page doi:10.1145/3394486.3403172 2020
[24]

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding, January 2022

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding, January 2022. URL http://arxiv.org/abs/2012.14740. arXiv:2012.14740 [cs]

work page arXiv 2022
[25]

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking, July 2022

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking, July 2022. URL http://arxiv.org/abs/2204.08387. arXiv:2204.08387 [cs]

work page arXiv 2022
[26]

Online publishing via pdf2htmlEX, 2013

Lu Wang and Wanmin Liu. Online publishing via pdf2htmlEX, 2013. URL https://www.tug.org/TUGboat/tb34-3/ tb108wang.pdf

work page 2013
[27]

Manmatha

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. DocFormer: End- to-End Transformer for Document Understanding, September 2021. URL http://arxiv.org/abs/2106.11539. arXiv:2106.11539 [cs]

work page arXiv 2021
[28]

Representation Learning for Information Extraction from Form-like Documents

Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. Representation Learning for Information Extraction from Form-like Documents. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 6495–6504, Online, July 2020. Association for Computational Linguistics. doi:...

work page doi:10.18653/v1/2020.acl-main.580 2020
[29]

OCR-free Document Understanding Transformer, October

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-free Document Understanding Transformer, October

work page
[30]

arXiv:2111.15664 [cs]

URL http://arxiv.org/abs/2111.15664. arXiv:2111.15664 [cs]

work page arXiv
[31]

End-to-end Document Recognition and Understanding with Dessurt, June 2022

Brian Davis, Bryan Morse, Bryan Price, Chris Tensmeyer, Curtis Wigington, and Vlad Morariu. End-to-end Document Recognition and Understanding with Dessurt, June 2022. URL http://arxiv.org/abs/2203.16618. arXiv:2203.16618 [cs]

work page arXiv 2022
[32]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, August 2021. URL http://arxiv.org/abs/ 2103.14030. arXiv:2103.14030 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, June 2021. URL http://arxiv.org/abs/ 2010.11929. arXiv:2010.11929 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoy- anov, and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Genera- tion, Translation, and Comprehension, October 2019. URL http://arxiv.org/abs/1910.13461. arXiv:1910.13461 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[35]

Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A Large Language Model for Science, November 2022. URL http://arxiv.org/abs/2211.09085. arXiv:2211.09085 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization, January 2019. URL http://arxiv.org/ abs/1711.05101. arXiv:1711.05101 [cs, math] version: 3

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

Simard, D

P.Y . Simard, D. Steinkraus, and J.C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Pro- ceedings., volume 1, pages 958–963, Edinburgh, UK, 2003. IEEE Comput. Soc. ISBN 978-0-7695-1960-9. doi: 10.1109/ICDAR.2003.1227801. URL http:...

work page doi:10.1109/icdar.2003.1227801 2003
[38]

Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A

Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A. Kalinin. Albumentations: Fast and Flexible Image Augmentations. Information, 11(2):125, February 2020. ISSN 2078-2489. doi: 10.3390/info11020125. URL https://www.mdpi.com/2078-2489/11/2/125. 11 Nougat Blecher et al

work page doi:10.3390/info11020125 2020
[39]

OCR-IDL: OCR Annotations for Industry Document Library Dataset, February 2022

Ali Furkan Biten, Rub `en Tito, Lluis Gomez, Ernest Valveny, and Dimosthenis Karatzas. OCR-IDL: OCR Annotations for Industry Document Library Dataset, February 2022. URL http://arxiv.org/abs/2202.12985. arXiv:2202.12985 [cs]

work page arXiv 2022
[40]

PDFFigures 2.0: Mining Figures from Research Papers

Christopher Clark and Santosh Divvala. PDFFigures 2.0: Mining Figures from Research Papers. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries , pages 143–152, Newark New Jersey USA, June 2016. ACM. ISBN 978-1-4503-4229-2. doi: 10.1145/2910896.2910904. URL https://dl.acm.org/doi/10. 1145/2910896.2910904

work page doi:10.1145/2910896.2910904 2016
[41]

Levenshtein

V . Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 1965. URL https://www.semanticscholar.org/paper/Binary-codes-capable-of-correcting-deletions% 2C-and-Levenshtein/b2f8876482c97e804bb50a5e2433881ae31d0cdd

work page 1965
[42]

Zellig S. Harris. Distributional Structure. WORD, 10(2-3):146–162, 1954. doi: 10.1080/00437956. 1954.11659520. URL https://doi.org/10.1080/00437956.1954.11659520. Publisher: Routledge eprint: https://doi.org/10.1080/00437956.1954.11659520

work page doi:10.1080/00437956 1954
[43]

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. Beyond neural scaling laws: beating power law scaling via data pruning, November 2022. URL http://arxiv.org/abs/2206.14486. arXiv:2206.14486 [cs, stat]

work page arXiv 2022
[44]

B leu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL htt...

work page doi:10.3115/1073083.1073135 2002
[45]

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https:/...

work page 2005
[46]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration, February 2020. URL http://arxiv.org/abs/1904.09751. arXiv:1904.09751 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[47]

(Herman William) March and Henry C

Herman W. (Herman William) March and Henry C. (Henry Charles) Wolff. Calculus. New York : McGraw-Hill,

work page
[48]

URL http://archive.org/details/calculus00marciala

work page
[49]

URL https://ntrs.nasa.gov/citations/ 19700022795

Kinetics and Thermodynamics in High-Temperature Gases, January 1970. URL https://ntrs.nasa.gov/citations/ 19700022795. NTRS Report/Patent Number: N70-32106-116 NTRS Document ID: 19700022795 NTRS Research Center: Glenn Research Center (GRC)

work page 1970
[50]

Hierarchical Neural Story Generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082

work page doi:10.18653/v1/p18-1082 2018
[51]

Cycle-Consistency for Robust Visual Question Answering

Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-Consistency for Robust Visual Question Answering, February 2019. URL http://arxiv.org/abs/1902.05660. arXiv:1902.05660 [cs]. 12 Nougat Blecher et al. A Dataset Name Number of Pages arXiv 7,511,745 PMC 536,319 IDL 446,777 Total 8,204,754 Table A.1: Dataset composition The most important data s...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

51: (a) if the edge AB lies in the surface of the water; (b) if the edge AB lies 5 feet below the surface

Find the pressure on the vertical parabolic gate, Fig. 51: (a) if the edge AB lies in the surface of the water; (b) if the edge AB lies 5 feet below the surface

work page
[53]

Find the pressure on a vertical semicircular gate whose diameter, 10 feet long, lies in the surface of the water

work page
[54]

The arithmetic mean,A, of a series ofn numbers,a1,a2,a3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Figure B.1: Example of an old calculus text book [45]

Arithmetic Mean. The arithmetic mean,A, of a series ofn numbers,a1,a2,a3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Figure B.1: Example of an old calculus text book [45]. 14 Nougat Blecher et al. Here ν1 = k1[H2],ν2 = k2[O2],ν3 = k3[H2],ν4 = k4[O2][M], and ν5 = k5[CO]. Thus the exponential growth constant λdepends on the gas composi- tion and the r...

work page 1970