pith. sign in

arxiv: 2504.11101 · v4 · submitted 2025-04-15 · 💻 cs.CV · cs.AI· cs.MM

Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Pith reviewed 2026-05-22 20:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords consensus entropyOCRvision-language modelsmulti-model agreementself-verificationunsupervised quality controlensemble methods
0
0 comments X

The pith

Multi-VLM agreement entropy verifies OCR quality and enables self-improvement without training or labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Consensus Entropy as a metric that judges OCR reliability by measuring how much the outputs from several vision-language models agree with one another. Correct transcriptions tend to produce matching strings across models while mistakes produce divergent ones, so the entropy of the agreement distribution acts as a direct signal of output quality. This signal drives CE-OCR, a lightweight framework that verifies results, selects the strongest candidate from the ensemble, and applies adaptive routing to allocate compute only where disagreement is high. The approach requires no training data or model fine-tuning and improves verification F1 by 42.1 percent over single-model judging while lifting OCR accuracy above both self-consistency and single-VLM baselines at fixed cost.

Core claim

Consensus Entropy estimates OCR reliability by computing the entropy of the output distribution across an ensemble of VLMs; because correct predictions converge in output space while errors diverge, the resulting scalar serves as an unsupervised quality score that enables verification, best-output selection, and adaptive routing in the CE-OCR framework.

What carries the argument

Consensus Entropy: the entropy of the empirical distribution of distinct OCR strings returned by multiple VLMs on the same image.

If this is right

  • Quality verification reaches 42.1 percent higher F1 than VLM-as-Judge.
  • CE-OCR raises OCR accuracy over self-consistency and single-model baselines at identical cost.
  • The method integrates plug-and-play with any set of VLMs and needs no supervision.
  • Adaptive routing based on entropy reduces compute on high-agreement samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agreement-entropy signal could be tested on other multimodal generation tasks such as captioning or chart parsing.
  • Large-scale curation pipelines for LLM training data could filter OCR-derived text automatically using only model disagreement.
  • If the convergence property persists when newer VLMs are added to the ensemble, accuracy gains would compound without changing the metric.

Load-bearing premise

Correct OCR predictions from different VLMs converge in output space while errors cause divergence, supplying a reliable quality signal.

What would settle it

On a held-out OCR dataset with ground-truth labels, high-entropy (low-agreement) samples show lower error rates than low-entropy samples, or the correlation between entropy and error rate is near zero.

Figures

Figures reproduced from arXiv: 2504.11101 by Chenhui Li, Erfei Cui, Gongshen Liu, Guoqing Wang, Tianyi Liang, Xinyue Huang, Xu Guo, Yulong Zhang.

Figure 1
Figure 1. Figure 1: Overview of the CE-OCR Framework. Given an input image, multiple Vision-Language Models (VLMs) independently generate OCR predictions. Pairwise similarities among these results yield a probability distribution over consensus quality, from which the Consensus Entropy δ is derived. Based on δ, a threshold gate θ determines the next step: low-entropy ensemble predictions are accepted, while inputs with entrop… view at source ↗
Figure 2
Figure 2. Figure 2: Prediction behaviors across entropy levels. Each plot visualizes VLM predictions in a 2D space. In low-entropy cases (a), predictions tightly cluster around the ground truth (green), while in medium (b) and high-entropy (c) settings, predictions in￾creasingly diverge (Details in Appendix B). character-level precision (e.g., standard OCR, mathematical expressions), we use Edit Distance; for tasks where sema… view at source ↗
Figure 3
Figure 3. Figure 3: Normalized entropy [0,1] analysis of four combination [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model Performance on OCRBench under Differ￾ent CE Thresholds. CE separately computed with two reference models (ref1: Qwen2VL-7B; ref2: Qwen2VL-72B). Maximum Accuracy is 1.0. 210models avg: average performance across 210 models. Shaded area covers ref1 and ref2 accuracy; solid line is their mean. method with the VLM-as-Judge approach [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison across token lengths on OCRBench. SC@3: Self-Consistency with 3 samples; Routing: ensemble with rephrasing; Single: Single models [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control. We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration. Code: https://github.com/Aslan-yulong/consensus-entropy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes Consensus Entropy (CE), a training-free, model-agnostic metric for estimating OCR output reliability based on the entropy of agreement among multiple VLMs. The central hypothesis is that correct predictions converge in output space while errors diverge. Using CE, they introduce CE-OCR for verification, selection, and adaptive routing, reporting a 42.1% F1 improvement over VLM-as-Judge and consistent gains over baselines at equivalent cost, with open code provided.

Significance. If the convergence assumption holds across models and datasets, this represents a meaningful contribution to unsupervised quality control in OCR for VLMs, with potential impact on data generation for LLM training. The work is strengthened by its parameter-free nature, reproducibility via the linked GitHub repository, and empirical gains demonstrated against explicit baselines.

major comments (2)
  1. [§3.2] §3.2 (Consensus Entropy definition): The paper does not provide a formal characterization of the output-space distance or agreement function used to compute entropy over variable-length OCR strings (e.g., exact lexical match, normalized edit distance, or embedding similarity). This is load-bearing for the central claim because the entropy signal's ability to track correctness versus spurious consensus depends directly on this choice.
  2. [§5.3] §5.3 (Robustness experiments): While aggregate F1 and accuracy gains are reported, there is no dedicated ablation or failure-case analysis on inputs where correlated model biases are likely (low-contrast text, rare glyphs, layout ambiguities). This leaves the weakest assumption untested at the level required to support the 42.1% gain claim as a general unsupervised signal.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'inter-model agreement entropy' is introduced without a one-sentence indication of the underlying string comparison method.
  2. [Table 2] Table 2: The cost column should explicitly state the number of VLM calls per sample for each baseline to make the 'same cost' comparison transparent.
  3. [§4.1] §4.1: The adaptive routing threshold is described in prose; an equation would improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Consensus Entropy definition): The paper does not provide a formal characterization of the output-space distance or agreement function used to compute entropy over variable-length OCR strings (e.g., exact lexical match, normalized edit distance, or embedding similarity). This is load-bearing for the central claim because the entropy signal's ability to track correctness versus spurious consensus depends directly on this choice.

    Authors: We agree that a formal characterization strengthens the central claim. The manuscript computes agreement via normalized Levenshtein distance (with a fixed threshold for binary agreement) to accommodate variable-length strings; we will revise §3.2 to include the precise mathematical definition of the agreement function, the resulting distribution over which entropy is taken, and a brief justification for the choice over alternatives such as embedding cosine similarity. revision: yes

  2. Referee: [§5.3] §5.3 (Robustness experiments): While aggregate F1 and accuracy gains are reported, there is no dedicated ablation or failure-case analysis on inputs where correlated model biases are likely (low-contrast text, rare glyphs, layout ambiguities). This leaves the weakest assumption untested at the level required to support the 42.1% gain claim as a general unsupervised signal.

    Authors: We acknowledge that targeted failure-case analysis would provide stronger support for the convergence assumption. While the reported experiments already span datasets containing low-contrast and ambiguous samples, we will add a dedicated subsection to §5.3 with quantitative ablations on low-contrast text, rare glyphs, and layout ambiguities, reporting CE scores and downstream F1 under these conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: CE is defined directly from observed inter-model disagreement as an external agreement metric.

full rationale

The paper introduces Consensus Entropy as a training-free metric computed from inter-VLM output agreement entropy, with no parameters fitted to target correctness labels, no self-citations invoked as load-bearing uniqueness theorems, and no reduction of predictions to fitted inputs by construction. The convergence assumption is presented as an empirical hypothesis tested on benchmarks rather than smuggled in via definition or prior self-work. The derivation chain remains self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on one domain assumption about output convergence and introduces no free parameters or new entities in the abstract description.

axioms (1)
  • domain assumption Correct predictions from different VLMs converge in output space while errors diverge.
    Stated as the core insight that justifies using agreement entropy as a reliability signal.

pith-pipeline@v0.9.0 · 5745 in / 1193 out tokens · 64345 ms · 2026-05-22T20:25:49.040612+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

    cs.CV 2025-12 unverdicted novelty 6.0

    High-entropy tokens act as concentrated multimodal failure points in VLMs, enabling sparse Entropy-Guided Attacks that achieve 93-95% success and 30-38% harmful rates with cross-model transfer.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    Llm-safety evaluations lack robustness, 2025

    Tim Beyer, Sophie Xhonneux, Simon Geisler, Gauthier Gidel, Leo Schwinn, and Stephan G ¨unnemann. Llm-safety evaluations lack robustness, 2025. 3

  3. [3]

    Nougat: Neural Optical Understanding for Academic Documents

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv:2308.13418, 2024. 2

  4. [4]

    A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024. 2

  5. [5]

    Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation, 2024

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation, 2024. 5, 8

  6. [6]

    Enhancing uncertainty modeling with se- mantic graph for hallucination detection.arXiv preprint arXiv:2501.02020, 2025

    Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, Feng Zheng, and Liang He. Enhancing uncertainty modeling with se- mantic graph for hallucination detection.arXiv preprint arXiv:2501.02020, 2025. 3

  7. [7]

    Beyond factuality: A comprehensive evaluation of large lan- guage models as knowledge generators

    Liang Chen, Yang Deng, Yatao Bian, Zeyu Qin, and et al. Beyond factuality: A comprehensive evaluation of large lan- guage models as knowledge generators. InEMNLP 2023, pages 6325–6341. Association for Computational Linguis- tics, 2023. 3

  8. [8]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 2

  9. [9]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

  10. [10]

    Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, and Philip S. Yu. Harnessing multiple large language mod- els: A survey on llm ensemble, 2025. 3, 9

  11. [11]

    Collective reasoning among llms a framework for answer validation without ground truth, 2025

    Seyed Pouyan Mousavi Davoudi, Alireza Shafiee Fard, and Alireza Amiri-Margavi. Collective reasoning among llms a framework for answer validation without ground truth, 2025. 9

  12. [12]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 5

  13. [13]

    Post-processing techniques applied to speech recognition output

    Jonathan G Fiscus. Post-processing techniques applied to speech recognition output. InDARPA Speech Recognition Workshop, pages 59–62, 1997. Also known as ROVER (Rec- ognizer Output V oting Error Reduction). 5, 8

  14. [14]

    Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning,

    Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

  15. [15]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059. PMLR, 2016. 3

  16. [16]

    Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024

    Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024. 2

  17. [17]

    Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

  18. [18]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xue- hao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024. 2

  19. [19]

    Smoothie: Label free language model routing

    Neel Guha, Mayee F Chen, Trevor Chow, Ishan S Khare, and Christopher Re. Smoothie: Label free language model routing. InNeuIPS, 2024. 3

  20. [21]

    Opendatalab: Empowering general ar- 9 tificial intelligence with open datasets.arXiv preprint arXiv:2407.13773, 2024

    Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, and Dahua Lin. Opendatalab: Empowering general ar- 9 tificial intelligence with open datasets.arXiv preprint arXiv:2407.13773, 2024. 2

  21. [22]

    Layoutlmv3: Pre-training for document ai with unified text and image masking

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia, pages 4083–4091,

  22. [23]

    Look before you leap: An exploratory study of uncertainty measurement for large language models.CoRR, abs/2307.10236, 2023

    Yuheng Huang, Jiayang Song, Zhijie Wang, Huaming Chen, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models.CoRR, abs/2307.10236, 2023. 3

  23. [24]

    Llm- blender: Ensembling large language models with pairwise ranking and generative fusion

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm- blender: Ensembling large language models with pairwise ranking and generative fusion. InACL, 2023. 3

  24. [25]

    What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017. 3

  25. [26]

    Prometheus: Inducing fine- grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sung- dong Kim, James Thorne, et al. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representa- tions, 2023. 2

  26. [27]

    Preference leakage: A contamination problem in llm- as-a-judge, 2025

    Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. Preference leakage: A contamination problem in llm- as-a-judge, 2025. 3

  27. [28]

    More agents is all you need

    Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and De- heng Ye. More agents is all you need.arXiv preprint arXiv:2402.05120, 2024. 3

  28. [29]

    Evaluating object hallucination in large vision- language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision- language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, 2023. 3

  29. [30]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computa- tional Linguistics. 3

  30. [31]

    Focus anywhere for fine- grained multi-page document understanding

    Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Focus anywhere for fine-grained multi-page document understanding.arXiv:2405.14295,

  31. [32]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2

  32. [33]

    Uncertainty quantification and confidence calibration in large language models: A survey, 2025

    Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey, 2025. 3

  33. [34]

    Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12), 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12), 2024. 2, 5

  34. [35]

    Calibrating llm-based evaluator

    Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. Calibrating llm-based evaluator. InProceed- ings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2638–2656, 2024. 3

  35. [36]

    Grobid.https : / / github

    Patrice Lopez. Grobid.https : / / github . com / kermitt2/grobid, 2008–2025. 2

  36. [37]

    Urg: A unified ranking and generation method for ensembling language models

    Bo Lv, Chen Tang, Yanan Zhang, Xin Liu, Ping Luo, and Yue Yu. Urg: A unified ranking and generation method for ensembling language models. InFindings of the ACL, 2024. 3

  37. [38]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models. InEMNLP 2023, pages 9004–9017. Association for Computational Linguis- tics, 2023. 3

  38. [39]

    His- torical review of ocr research and development.Proceedings of the IEEE, 80(7):1029–1058, 1992

    Shunji Mori, Ching Y Suen, and Kazuhiko Yamamoto. His- torical review of ocr research and development.Proceedings of the IEEE, 80(7):1029–1058, 1992. 2

  39. [40]

    Gpt-4o technical overview.https://openai

    OpenAI. Gpt-4o technical overview.https://openai. com/index/gpt- 4o- system- card/, 2024. Ac- cessed: 2025-04-06. 2, 5

  40. [41]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  41. [42]

    Un- certainty quantification via stable distribution propagation,

    Felix Petersen, Aashwin Mishra, Hilde Kuehne, Christian Borgelt, Oliver Deussen, and Mikhail Yurochkin. Un- certainty quantification via stable distribution propagation,

  42. [43]

    olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025

    Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025. 2

  43. [44]

    Adversarial ml problems are getting harder to solve and to evaluate, 2025

    Javier Rando, Jie Zhang, Nicholas Carlini, and Florian Tram`er. Adversarial ml problems are getting harder to solve and to evaluate, 2025. 3

  44. [45]

    Vila: Improving struc- tured content extraction from scientific pdfs using visual lay- out groups.Transactions of the Association for Computa- tional Linguistics, 10:376–392, 2022

    Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S Weld, and Doug Downey. Vila: Improving struc- tured content extraction from scientific pdfs using visual lay- out groups.Transactions of the Association for Computa- tional Linguistics, 10:376–392, 2022. 2

  45. [46]

    Getting more out of mixture of lan- guage model reasoning experts

    Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, and Jordan Boyd-Graber. Getting more out of mixture of lan- guage model reasoning experts. InFindings of EMNLP,

  46. [47]

    Routledge, 2018

    Bernard W Silverman.Density estimation for statistics and data analysis. Routledge, 2018. 3

  47. [48]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri `a Garriga- Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022. 3

  48. [49]

    Llm-topla: Efficient llm ensemble by maximising diversity

    Selim Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of EMNLP, 2024. 3 10

  49. [51]

    Mineru: An open-source solution for precise document content extrac- tion, 2024

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. Mineru: An open-source solution for precise document content extrac- tion, 2024. 2

  50. [52]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 5

  51. [53]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,

  52. [54]

    Latent space chain-of-embedding en- ables output-free llm self-evaluation.arXiv preprint arXiv:2410.13640, 2024

    Yiming Wang, Pei Zhang, Baosong Yang, Derek F Wong, and Rui Wang. Latent space chain-of-embedding en- ables output-free llm self-evaluation.arXiv preprint arXiv:2410.13640, 2024. 3, 9

  53. [55]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv:2409.01704, 2024. 2

  54. [56]

    Can LLMs express their un- certainty? an empirical evaluation of confidence elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their un- certainty? an empirical evaluation of confidence elicitation in LLMs. InThe Twelfth International Conference on Learn- ing Representations, 2024. 3

  55. [57]

    Pub- laynet: largest dataset ever for document layout analysis

    Zhong Xu, Jianbin Tang, and Antonio Jimeno Yepes. Pub- laynet: largest dataset ever for document layout analysis. In2019 International conference on document analysis and recognition, pages 1015–1022, 2019. 2

  56. [58]

    Cc-ocr: A comprehensive and challenging ocr benchmark for evalu- ating large multimodal models in literacy, 2024

    Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, and Junyang Lin. Cc-ocr: A comprehensive and challenging ocr benchmark for evalu- ating large multimodal models in literacy, 2024. 2, 5

  57. [59]

    Vl- uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation, 2024

    Ruiyang Zhang, Hu Zhang, and Zhedong Zheng. Vl- uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation, 2024. 3

  58. [60]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 2

  59. [61]

    Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint, 2023

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint, 2023. 2 11