{"total":18,"items":[{"citing_arxiv_id":"2605.31550","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-29T17:10:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STR rewrites table cells as <item path, feature path, value> triplets and uses TripletQL to match or exceed HTML baselines on four benchmarks while cutting tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27978","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ABot-OCR Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-27T05:16:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ABot-OCR is a new end-to-end VLM for direct image-to-Markdown transcription using a custom data engine and structure-constrained RL optimization, reporting SOTA scores of 92.81/93.30 on OmniDocBench v1.5/v1.6.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22100","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing","primary_cat":"cs.AI","submitted_at":"2026-05-21T07:36:41+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12623","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DocAtlas: Multilingual Document Understanding Across 80+ Languages","primary_cat":"cs.CL","submitted_at":"2026-05-12T18:09:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07492","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings","primary_cat":"cs.CV","submitted_at":"2026-05-08T09:30:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"DotsMOCR [25] 3B 76.270.15166.23 77.650.27373.160.19864.32 74.950.30961.73 0.312 54.39 61.97 0.393 70.39 Dolphin-v2 [6] 3B 65.90 0.342 59.80 72.12 0.429 60.24 0.393 52.20 67.86 0.461 44.92 0.553 39.98 50.04 0.558 57.02 MonkeyOCR-pro-3B [26] 3B 62.23 0.346 48.46 72.83 0.492 57.40 0.397 45.57 66.32 0.526 46.49 0.511 38.18 52.43 0.600 55.37 YouTu-Parsing [27] 2B 75.02 0.230 67.34 80.74 0.358 69.66 0.270 61.44 74.49 0.388 60.29 0.360 52.20 64.69 0.430 68.32 MinerU2.5-Pro [5] 1.2B 75.87 0.222 65.14 84.68 0.346 71.77 0.272 61.79 80.73 0.378 62.56 0.375 52.70 72.47 0.446 70.07 MinerU2.5 [4] 1.2B 74.90 0.184 62.08 81.04 0.32768.92 0.245 56.99 74.24 0.374 59.15 0.370 49.01 65.41 0.446 67.66 MonkeyOCR-pro-1.2B [26] 1."},{"citing_arxiv_id":"2604.12978","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts","primary_cat":"cs.CL","submitted_at":"2026-04-14T17:12:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Bench [35], OCRBench v2 [ 15], CC-OCR [ 58], and OmniDocBench [ 41], evaluate models on Latin, CJK, and a small number of other mid-resource scripts. Even recent work explicitly targeting multilingual OCR, such as [ 31] with its XDocParse benchmark (not publicly released) covering 126 languages, focuses onlanguagesrather thanscripts, and the underlying script diversity remains limited. Work on minority scripts [33] has made important progress, but covers only a handful of writing systems. No existing benchmark evaluates OCR across the full breadth of Unicode. This matters because the Unicode Standard (version 17.0 at the time of writing) currently encodes 172 scripts, representing thousands of years of human writing across every inhabited continent. Many of"},{"citing_arxiv_id":"2604.12812","ref_index":6,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding","primary_cat":"cs.AI","submitted_at":"2026-04-14T14:39:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-efficient resolution allocation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction. InarXiv:2409.18839, 2024. 1 [5] Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InarXiv:2505.14059, 2025. [6] Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. In arXiv:2506.05218, 2025. 1 [7] YL Liu, HL Li, X Bai, et al. A brief analysis of chatgpt: historical evolution current applications and future prospects"},{"citing_arxiv_id":"2604.06160","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Character Error Vector: Decomposable errors for page-level OCR evaluation","primary_cat":"cs.CV","submitted_at":"2026-04-07T17:56:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The Character Error Vector is a decomposable bag-of-characters evaluator for page-level OCR that remains defined under parsing errors and bridges parsing metrics with local CER.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv:2109.10282 [cs]. Sept. 2022.doi:10. 48550/arXiv.2109.10282.url:http://arxiv.org/abs/2109.10282 (visited on 03/27/2026). [24] Yumeng Li et al.dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model. arXiv:2512.02498 [cs]. Dec. 2025.doi:10. 48550/arXiv.2512.02498.url:http://arxiv.org/abs/2512.02498 (visited on 03/27/2026). [25] Zhang Li et al.MonkeyOCR: Document Parsing with a Structure-Recognition- Relation Triplet Paradigm. arXiv:2506.05218 [cs]. Feb. 2026.doi:10 . 48550/arXiv.2506.05218.url:http://arxiv.org/abs/2506.05218 (visited on 04/04/2026). [26] Nikolaos Livathinos et al.Advanced Layout Analysis Models for Docling. arXiv:2509.11720 [cs]. Sept. 2025.doi:10 . 48550 / arXiv ."},{"citing_arxiv_id":"2604.04771","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale","primary_cat":"cs.CV","submitted_at":"2026-04-06T15:44:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, native-resolution processing incurs O(N 2) token complexity, creating efficiency bottlenecks for high-resolution documents. Decoupled VLM methods.These methods separate layout analysis from content recognition, com- bining the controllability of pipeline approaches with the semantic modeling power of VLMs. Early works such as Dolphin [12] and MonkeyOCR [19] demonstrated the viability of this paradigm but faced limitations in resolution handling or system complexity. MinerU2.5 [25] unifies layout analysis and content recognition within a single 1.2B-parameter model with native-resolution support [ 26], balancing resolution fidelity, efficiency, and deployment complexity. Subsequent works extend the decoupled paradigm along various axes: multi-token prediction for throughput [11], diffusion-based"},{"citing_arxiv_id":"2604.02880","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InstructTable: Improving Table Structure Recognition Through Instructions","primary_cat":"cs.CV","submitted_at":"2026-04-03T08:44:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public benchmarks and a new complex-table test set.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"series [42, 52] proposes a unified framework to simulta- neously handle multiple text parsing tasks including TSR. TabPedia [54] integrates various visual table understand- ing tasks through a concept synergy mechanism. GOT [44] extends the OCR to recognize various optical signals, in- cluding tables, text, and other elements, under a unified character concept. MonkeyOCR [17] employs a Structure- Recognition-Relation (SRR) triplet paradigm, decompos- ing document parsing into layout analysis, content identi- fication, and logical ordering to balance accuracy and ef- ficiency. More recent work [6, 26, 36] employs a two- stage scheme of layout detection-content recognition for document parsing, while MinerU 2.5 and PaddleOCR-VL"},{"citing_arxiv_id":"2604.02692","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing","primary_cat":"cs.CV","submitted_at":"2026-04-03T03:36:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"DiT [10] demonstrates that document-domain pretraining can sub- stantially benefit downstream layout analysis, while LayoutLMv3 [9] unifies text, image, and layout modeling within a generalized multimodal pretraining framework. Furthermore, mmLayout [28] emphasizes multi-grained document structure modeling beyond local token-level features. At the system level, MonkeyOCR [14] and PP-StructureV3 [5] integrate layout analysis with OCR and structured export within modular pipelines. Subsequent studies model layout prediction and reading order more jointly. DLAFormer [27] unifies multiple DLA subtasks within a Transformer-based framework. GraphLayoutLM [ 11] explores this trajectory by introducing graph-based layout modeling and"},{"citing_arxiv_id":"2603.24326","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing","primary_cat":"cs.CV","submitted_at":"2026-03-25T14:08:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.23885","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training","primary_cat":"cs.CV","submitted_at":"2026-03-25T03:19:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.09677","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Logics-Parsing-Omni Technical Report","primary_cat":"cs.AI","submitted_at":"2026-03-10T13:46:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Omni Parsing framework converts complex multimodal signals into locatable, enumerable, and traceable structured knowledge via hierarchical detection, recognition, and interpreting with strict evidence alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.11731","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Thinking with Drafting: Optical Decompression via Logical Reconstruction","primary_cat":"cs.CL","submitted_at":"2026-02-12T08:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Thinking with Drafting reconceptualizes visual reasoning as optical decompression by forcing models to draft mental models into executable DSL code for deterministic self-verification on the VisAlg benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21957","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing","primary_cat":"cs.CV","submitted_at":"2026-01-29T16:35:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PaddleOCR-VL-1.5 is a 0.9B VLM achieving 94.5% SOTA accuracy on OmniDocBench v1.5, with added robustness to physical distortions and support for seal recognition plus text spotting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18234","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-OCR: Contexts Optical Compression","primary_cat":"cs.CV","submitted_at":"2025-10-21T02:41:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"overall text formula table order overall text formula table order Pipline Models Dolphin [11] - 0.356 0.352 0.465 0.258 0.35 0.44 0.44 0.604 0.367 0.351 Marker [1] - 0.296 0.085 0.374 0.609 0.116 0.497 0.293 0.688 0.678 0.329 Mathpix [2] - 0.191 0.105 0.306 0.243 0.108 0.364 0.381 0.454 0.32 0.30 MinerU-2.1.1 [34] - 0.162 0.072 0.313 0.166 0.097 0.244 0.111 0.581 0.15 0.136 MonkeyOCR-1.2B [18] - 0.154 0.062 0.295 0.164 0.094 0.263 0.179 0.464 0.168 0.243 PPstructure-v3 [9] - 0.152 0.073 0.295 0.162 0.077 0.223 0.136 0.535 0.111 0.11 End-to-end Models Nougat [6] 2352 0.452 0.365 0.488 0.572 0.382 0.973 0.998 0.941 1.00 0.954 SmolDocling [25] 392 0.493 0.262 0.753 0.729 0.227 0.816 0.838 0.997 0.907 0.522 InternVL2-76B [8] 6790 0.44 0.353 0."},{"citing_arxiv_id":"2509.22186","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing","primary_cat":"cs.CV","submitted_at":"2025-09-26T10:45:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}