Reading order detection in a document

Lei Cui (Beijing) , Yiheng Xu (Beijing) , Yang Xu (Harbin) , Furu Wei (Beijing) , Zilong Wang (Shanghai)

Authors on Pith no claims yet

Pith reviewed 2026-05-06 03:36 UTC · model claude-opus-4-7

classification patents

keywords reading order detectiondocument layout analysisdocument AIlayout-aware embeddingsOCR post-processingtext and layout pre-trainingbounding box embeddingdocument understanding

0 comments

The pith

A neural method orders the text of a scanned document by fusing what each token says with where it sits on the page.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reading order — the linear sequence in which a human would read the text blocks on a page — is non-trivial for documents with columns, sidebars, tables, captions, and forms. The patent describes a system that treats reading-order detection as a learned task over both content and geometry. Each text element produces a text embedding and a layout embedding derived from its bounding box; these are concatenated and passed through a feature-extraction network to yield semantic representations that encode position-aware meaning. A downstream component consumes these representations to produce an ordering. The training signal is the known reading order of sample documents, allowing the two networks to be optimized jointly. The promise to a sympathetic reader: downstream document AI — extraction, summarization, question answering, OCR post-processing — depends on getting the sequence right before any of it works, and a layout-aware learner should beat the heuristic top-to-bottom, left-to-right rules that fail on real-world layouts.

Core claim

The patent claims a method for determining the reading order of text in a document by jointly using two signals: the text itself and the spatial layout (bounding-box positions) of each text element on the page. A neural feature extractor converts each text element and its layout coordinates into embeddings, concatenates them, and produces a per-element semantic representation. A second stage uses these representations to output an ordering of the text elements. Training is supervised by documents whose ground-truth reading order is known, learning both the feature extractor and the order-determination network end to end.

What carries the argument

A two-part learned pipeline: (1) a feature-extraction network that takes a text embedding plus a layout embedding (bounding-box coordinates of each element), concatenates them, and outputs a semantic representation per text element; (2) an order-determination component that maps the set of representations to a reading-order sequence. Both are trained against ground-truth orderings of sample documents. The load-bearing idea is that injecting bounding-box geometry into the token representation, rather than relying on raster order or rule-based sorting, lets the network learn page-specific reading conventions.

If this is right

<parameter name="0">Document understanding systems built on top of this — information extraction
summarization
translation of scanned PDFs — gain a principled front-end instead of brittle geometric sorting rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

<parameter name="0">The same architecture is a natural fit for predicting other structural relations on a page — parent/child block hierarchy
key–value pairings in forms
column membership — by swapping the supervision target while keeping the text+layout fusion backbone.

Load-bearing premise

That combining text embeddings with bounding-box embeddings and feeding them into a neural network — to predict reading order — is a distinct enough invention to stand apart from the layout-aware document models already published before the priority date.

What would settle it

Run the claimed pipeline against a held-out benchmark of documents with annotated reading order (forms, multi-column articles, receipts) and compare against (a) a pure top-to-bottom-left-to-right heuristic and (b) a text-only sequence model with no layout embedding. If layout-augmented embeddings do not measurably improve ordering accuracy on layouts where the heuristic fails, the central premise that layout fusion is what carries the method does not hold.

Figures

Figures reproduced from USPTO: patent/us-12619828 by Furu Wei (Beijing), Lei Cui (Beijing), Yang Xu (Harbin), Yiheng Xu (Beijing), Zilong Wang (Shanghai).

**Sheet 1.** Drawing sheet 1 from US 12619828. view at source ↗

**Sheet 2.** Drawing sheet 2 from US 12619828. view at source ↗

**Sheet 3.** Drawing sheet 3 from US 12619828. view at source ↗

**Sheet 4.** Drawing sheet 4 from US 12619828. view at source ↗

read the original abstract

According to embodiments of the present disclosure, there is provided a solution for reading order detection in a document. In the solution, a computer-implemented method includes: determining a text sequence and layout information presented in a document, the text sequence comprising a plurality of text elements, the layout information indicating a spatial layout of the plurality of text elements in the document; generating a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and the layout information; and determining a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations. According to the solution, the introduction of the layout information can better characterize a spatial layout manner of the text elements in a specific document, thereby determining the reading order more effectively and accurately.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a patent, the document does not derive results from axioms; it specifies a method and asserts claim scope. The "ledger" here records the structural commitments the claims rest on: (i) availability of layout/bbox information for each text element (presupposes upstream OCR or PDF parsing), (ii) a trainable encoder that benefits from concatenated text+layout embeddings, and (iii) availability of format-derived order labels for training. No new physical entities are postulated; no fitted parameters are central to the claim. The main "free parameter" is the size of the trained model, which is unspecified.

pith-pipeline@v0.9.0 · 12813 in / 5335 out tokens · 82135 ms · 2026-05-06T03:36:18.676689+00:00 · methodology

Review history (2 revisions) →