Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Chenhui Li; Erfei Cui; Gongshen Liu; Guoqing Wang; Tianyi Liang; Xinyue Huang; Xu Guo; Yulong Zhang

arxiv: 2504.11101 · v4 · submitted 2025-04-15 · 💻 cs.CV · cs.AI· cs.MM

Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Yulong Zhang , Tianyi Liang , Xinyue Huang , Erfei Cui , Guoqing Wang , Xu Guo , Chenhui Li , Gongshen Liu This is my paper

Pith reviewed 2026-05-22 20:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords consensus entropyOCRvision-language modelsmulti-model agreementself-verificationunsupervised quality controlensemble methods

0 comments

The pith

Multi-VLM agreement entropy verifies OCR quality and enables self-improvement without training or labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Consensus Entropy as a metric that judges OCR reliability by measuring how much the outputs from several vision-language models agree with one another. Correct transcriptions tend to produce matching strings across models while mistakes produce divergent ones, so the entropy of the agreement distribution acts as a direct signal of output quality. This signal drives CE-OCR, a lightweight framework that verifies results, selects the strongest candidate from the ensemble, and applies adaptive routing to allocate compute only where disagreement is high. The approach requires no training data or model fine-tuning and improves verification F1 by 42.1 percent over single-model judging while lifting OCR accuracy above both self-consistency and single-VLM baselines at fixed cost.

Core claim

Consensus Entropy estimates OCR reliability by computing the entropy of the output distribution across an ensemble of VLMs; because correct predictions converge in output space while errors diverge, the resulting scalar serves as an unsupervised quality score that enables verification, best-output selection, and adaptive routing in the CE-OCR framework.

What carries the argument

Consensus Entropy: the entropy of the empirical distribution of distinct OCR strings returned by multiple VLMs on the same image.

If this is right

Quality verification reaches 42.1 percent higher F1 than VLM-as-Judge.
CE-OCR raises OCR accuracy over self-consistency and single-model baselines at identical cost.
The method integrates plug-and-play with any set of VLMs and needs no supervision.
Adaptive routing based on entropy reduces compute on high-agreement samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agreement-entropy signal could be tested on other multimodal generation tasks such as captioning or chart parsing.
Large-scale curation pipelines for LLM training data could filter OCR-derived text automatically using only model disagreement.
If the convergence property persists when newer VLMs are added to the ensemble, accuracy gains would compound without changing the metric.

Load-bearing premise

Correct OCR predictions from different VLMs converge in output space while errors cause divergence, supplying a reliable quality signal.

What would settle it

On a held-out OCR dataset with ground-truth labels, high-entropy (low-agreement) samples show lower error rates than low-entropy samples, or the correlation between entropy and error rate is near zero.

Figures

Figures reproduced from arXiv: 2504.11101 by Chenhui Li, Erfei Cui, Gongshen Liu, Guoqing Wang, Tianyi Liang, Xinyue Huang, Xu Guo, Yulong Zhang.

**Figure 1.** Figure 1: Overview of the CE-OCR Framework. Given an input image, multiple Vision-Language Models (VLMs) independently generate OCR predictions. Pairwise similarities among these results yield a probability distribution over consensus quality, from which the Consensus Entropy δ is derived. Based on δ, a threshold gate θ determines the next step: low-entropy ensemble predictions are accepted, while inputs with entrop… view at source ↗

**Figure 2.** Figure 2: Prediction behaviors across entropy levels. Each plot visualizes VLM predictions in a 2D space. In low-entropy cases (a), predictions tightly cluster around the ground truth (green), while in medium (b) and high-entropy (c) settings, predictions increasingly diverge (Details in Appendix B). character-level precision (e.g., standard OCR, mathematical expressions), we use Edit Distance; for tasks where sema… view at source ↗

**Figure 3.** Figure 3: Normalized entropy [0,1] analysis of four combination [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Model Performance on OCRBench under Different CE Thresholds. CE separately computed with two reference models (ref1: Qwen2VL-7B; ref2: Qwen2VL-72B). Maximum Accuracy is 1.0. 210models avg: average performance across 210 models. Shaded area covers ref1 and ref2 accuracy; solid line is their mean. method with the VLM-as-Judge approach [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison across token lengths on OCRBench. SC@3: Self-Consistency with 3 samples; Routing: ensemble with rephrasing; Single: Single models [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control. We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration. Code: https://github.com/Aslan-yulong/consensus-entropy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes Consensus Entropy (CE), a training-free, model-agnostic metric for estimating OCR output reliability based on the entropy of agreement among multiple VLMs. The central hypothesis is that correct predictions converge in output space while errors diverge. Using CE, they introduce CE-OCR for verification, selection, and adaptive routing, reporting a 42.1% F1 improvement over VLM-as-Judge and consistent gains over baselines at equivalent cost, with open code provided.

Significance. If the convergence assumption holds across models and datasets, this represents a meaningful contribution to unsupervised quality control in OCR for VLMs, with potential impact on data generation for LLM training. The work is strengthened by its parameter-free nature, reproducibility via the linked GitHub repository, and empirical gains demonstrated against explicit baselines.

major comments (2)

[§3.2] §3.2 (Consensus Entropy definition): The paper does not provide a formal characterization of the output-space distance or agreement function used to compute entropy over variable-length OCR strings (e.g., exact lexical match, normalized edit distance, or embedding similarity). This is load-bearing for the central claim because the entropy signal's ability to track correctness versus spurious consensus depends directly on this choice.
[§5.3] §5.3 (Robustness experiments): While aggregate F1 and accuracy gains are reported, there is no dedicated ablation or failure-case analysis on inputs where correlated model biases are likely (low-contrast text, rare glyphs, layout ambiguities). This leaves the weakest assumption untested at the level required to support the 42.1% gain claim as a general unsupervised signal.

minor comments (3)

[Abstract] Abstract: The phrase 'inter-model agreement entropy' is introduced without a one-sentence indication of the underlying string comparison method.
[Table 2] Table 2: The cost column should explicitly state the number of VLM calls per sample for each baseline to make the 'same cost' comparison transparent.
[§4.1] §4.1: The adaptive routing threshold is described in prose; an equation would improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [§3.2] §3.2 (Consensus Entropy definition): The paper does not provide a formal characterization of the output-space distance or agreement function used to compute entropy over variable-length OCR strings (e.g., exact lexical match, normalized edit distance, or embedding similarity). This is load-bearing for the central claim because the entropy signal's ability to track correctness versus spurious consensus depends directly on this choice.

Authors: We agree that a formal characterization strengthens the central claim. The manuscript computes agreement via normalized Levenshtein distance (with a fixed threshold for binary agreement) to accommodate variable-length strings; we will revise §3.2 to include the precise mathematical definition of the agreement function, the resulting distribution over which entropy is taken, and a brief justification for the choice over alternatives such as embedding cosine similarity. revision: yes
Referee: [§5.3] §5.3 (Robustness experiments): While aggregate F1 and accuracy gains are reported, there is no dedicated ablation or failure-case analysis on inputs where correlated model biases are likely (low-contrast text, rare glyphs, layout ambiguities). This leaves the weakest assumption untested at the level required to support the 42.1% gain claim as a general unsupervised signal.

Authors: We acknowledge that targeted failure-case analysis would provide stronger support for the convergence assumption. While the reported experiments already span datasets containing low-contrast and ambiguous samples, we will add a dedicated subsection to §5.3 with quantitative ablations on low-contrast text, rare glyphs, and layout ambiguities, reporting CE scores and downstream F1 under these conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: CE is defined directly from observed inter-model disagreement as an external agreement metric.

full rationale

The paper introduces Consensus Entropy as a training-free metric computed from inter-VLM output agreement entropy, with no parameters fitted to target correctness labels, no self-citations invoked as load-bearing uniqueness theorems, and no reduction of predictions to fitted inputs by construction. The convergence assumption is presented as an empirical hypothesis tested on benchmarks rather than smuggled in via definition or prior self-work. The derivation chain remains self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on one domain assumption about output convergence and introduces no free parameters or new entities in the abstract description.

axioms (1)

domain assumption Correct predictions from different VLMs converge in output space while errors diverge.
Stated as the core insight that justifies using agreement entropy as a reliability signal.

pith-pipeline@v0.9.0 · 5745 in / 1193 out tokens · 64345 ms · 2026-05-22T20:25:49.040612+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

correct predictions converge in output space, while errors diverge... Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Low CE indicates high agreement and reliability; high CE signals ambiguity and potential error

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
cs.CV 2025-12 unverdicted novelty 6.0

High-entropy tokens act as concentrated multimodal failure points in VLMs, enabling sparse Entropy-Guided Attacks that achieve 93-95% success and 30-38% harmful rates with cross-model transfer.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Llm-safety evaluations lack robustness, 2025

Tim Beyer, Sophie Xhonneux, Simon Geisler, Gauthier Gidel, Leo Schwinn, and Stephan G ¨unnemann. Llm-safety evaluations lack robustness, 2025. 3

work page 2025
[3]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv:2308.13418, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024. 2

work page 2024
[5]

Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation, 2024. 5, 8

work page 2024
[6]

Enhancing uncertainty modeling with se- mantic graph for hallucination detection.arXiv preprint arXiv:2501.02020, 2025

Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, Feng Zheng, and Liang He. Enhancing uncertainty modeling with se- mantic graph for hallucination detection.arXiv preprint arXiv:2501.02020, 2025. 3

work page arXiv 2025
[7]

Beyond factuality: A comprehensive evaluation of large lan- guage models as knowledge generators

Liang Chen, Yang Deng, Yatao Bian, Zeyu Qin, and et al. Beyond factuality: A comprehensive evaluation of large lan- guage models as knowledge generators. InEMNLP 2023, pages 6325–6341. Association for Computational Linguis- tics, 2023. 3

work page 2023
[8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

work page
[10]

Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, and Philip S. Yu. Harnessing multiple large language mod- els: A survey on llm ensemble, 2025. 3, 9

work page 2025
[11]

Collective reasoning among llms a framework for answer validation without ground truth, 2025

Seyed Pouyan Mousavi Davoudi, Alireza Shafiee Fard, and Alireza Amiri-Margavi. Collective reasoning among llms a framework for answer validation without ground truth, 2025. 9

work page 2025
[12]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 5

work page 2024
[13]

Post-processing techniques applied to speech recognition output

Jonathan G Fiscus. Post-processing techniques applied to speech recognition output. InDARPA Speech Recognition Workshop, pages 59–62, 1997. Also known as ROVER (Rec- ognizer Output V oting Error Reduction). 5, 8

work page 1997
[14]

Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning,

Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

work page
[15]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059. PMLR, 2016. 3

work page 2016
[16]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024. 2

work page 2024
[17]

Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024
[18]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xue- hao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Smoothie: Label free language model routing

Neel Guha, Mayee F Chen, Trevor Chow, Ishan S Khare, and Christopher Re. Smoothie: Label free language model routing. InNeuIPS, 2024. 3

work page 2024
[21]

Opendatalab: Empowering general ar- 9 tificial intelligence with open datasets.arXiv preprint arXiv:2407.13773, 2024

Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, and Dahua Lin. Opendatalab: Empowering general ar- 9 tificial intelligence with open datasets.arXiv preprint arXiv:2407.13773, 2024. 2

work page arXiv 2024
[22]

Layoutlmv3: Pre-training for document ai with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia, pages 4083–4091,

work page
[23]

Look before you leap: An exploratory study of uncertainty measurement for large language models.CoRR, abs/2307.10236, 2023

Yuheng Huang, Jiayang Song, Zhijie Wang, Huaming Chen, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models.CoRR, abs/2307.10236, 2023. 3

work page arXiv 2023
[24]

Llm- blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm- blender: Ensembling large language models with pairwise ranking and generative fusion. InACL, 2023. 3

work page 2023
[25]

What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017. 3

work page 2017
[26]

Prometheus: Inducing fine- grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sung- dong Kim, James Thorne, et al. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representa- tions, 2023. 2

work page 2023
[27]

Preference leakage: A contamination problem in llm- as-a-judge, 2025

Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. Preference leakage: A contamination problem in llm- as-a-judge, 2025. 3

work page 2025
[28]

More agents is all you need

Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and De- heng Ye. More agents is all you need.arXiv preprint arXiv:2402.05120, 2024. 3

work page arXiv 2024
[29]

Evaluating object hallucination in large vision- language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision- language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, 2023. 3

work page 2023
[30]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computa- tional Linguistics. 3

work page 2004
[31]

Focus anywhere for fine- grained multi-page document understanding

Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Focus anywhere for fine-grained multi-page document understanding.arXiv:2405.14295,

work page arXiv
[32]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2

work page 2023
[33]

Uncertainty quantification and confidence calibration in large language models: A survey, 2025

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey, 2025. 3

work page 2025
[34]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12), 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12), 2024. 2, 5

work page 2024
[35]

Calibrating llm-based evaluator

Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. Calibrating llm-based evaluator. InProceed- ings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2638–2656, 2024. 3

work page 2024
[36]

Grobid.https : / / github

Patrice Lopez. Grobid.https : / / github . com / kermitt2/grobid, 2008–2025. 2

work page 2008
[37]

Urg: A unified ranking and generation method for ensembling language models

Bo Lv, Chen Tang, Yanan Zhang, Xin Liu, Ping Luo, and Yue Yu. Urg: A unified ranking and generation method for ensembling language models. InFindings of the ACL, 2024. 3

work page 2024
[38]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models. InEMNLP 2023, pages 9004–9017. Association for Computational Linguis- tics, 2023. 3

work page 2023
[39]

His- torical review of ocr research and development.Proceedings of the IEEE, 80(7):1029–1058, 1992

Shunji Mori, Ching Y Suen, and Kazuhiko Yamamoto. His- torical review of ocr research and development.Proceedings of the IEEE, 80(7):1029–1058, 1992. 2

work page 1992
[40]

Gpt-4o technical overview.https://openai

OpenAI. Gpt-4o technical overview.https://openai. com/index/gpt- 4o- system- card/, 2024. Ac- cessed: 2025-04-06. 2, 5

work page 2024
[41]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page
[42]

Un- certainty quantification via stable distribution propagation,

Felix Petersen, Aashwin Mishra, Hilde Kuehne, Christian Borgelt, Oliver Deussen, and Mikhail Yurochkin. Un- certainty quantification via stable distribution propagation,

work page
[43]

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025. 2

work page 2025
[44]

Adversarial ml problems are getting harder to solve and to evaluate, 2025

Javier Rando, Jie Zhang, Nicholas Carlini, and Florian Tram`er. Adversarial ml problems are getting harder to solve and to evaluate, 2025. 3

work page 2025
[45]

Vila: Improving struc- tured content extraction from scientific pdfs using visual lay- out groups.Transactions of the Association for Computa- tional Linguistics, 10:376–392, 2022

Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S Weld, and Doug Downey. Vila: Improving struc- tured content extraction from scientific pdfs using visual lay- out groups.Transactions of the Association for Computa- tional Linguistics, 10:376–392, 2022. 2

work page 2022
[46]

Getting more out of mixture of lan- guage model reasoning experts

Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, and Jordan Boyd-Graber. Getting more out of mixture of lan- guage model reasoning experts. InFindings of EMNLP,

work page
[47]

Routledge, 2018

Bernard W Silverman.Density estimation for statistics and data analysis. Routledge, 2018. 3

work page 2018
[48]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri `a Garriga- Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Llm-topla: Efficient llm ensemble by maximising diversity

Selim Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of EMNLP, 2024. 3 10

work page 2024
[51]

Mineru: An open-source solution for precise document content extrac- tion, 2024

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. Mineru: An open-source solution for precise document content extrac- tion, 2024. 2

work page 2024
[52]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Latent space chain-of-embedding en- ables output-free llm self-evaluation.arXiv preprint arXiv:2410.13640, 2024

Yiming Wang, Pei Zhang, Baosong Yang, Derek F Wong, and Rui Wang. Latent space chain-of-embedding en- ables output-free llm self-evaluation.arXiv preprint arXiv:2410.13640, 2024. 3, 9

work page arXiv 2024
[55]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv:2409.01704, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Can LLMs express their un- certainty? an empirical evaluation of confidence elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their un- certainty? an empirical evaluation of confidence elicitation in LLMs. InThe Twelfth International Conference on Learn- ing Representations, 2024. 3

work page 2024
[57]

Pub- laynet: largest dataset ever for document layout analysis

Zhong Xu, Jianbin Tang, and Antonio Jimeno Yepes. Pub- laynet: largest dataset ever for document layout analysis. In2019 International conference on document analysis and recognition, pages 1015–1022, 2019. 2

work page 2019
[58]

Cc-ocr: A comprehensive and challenging ocr benchmark for evalu- ating large multimodal models in literacy, 2024

Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, and Junyang Lin. Cc-ocr: A comprehensive and challenging ocr benchmark for evalu- ating large multimodal models in literacy, 2024. 2, 5

work page 2024
[59]

Vl- uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation, 2024

Ruiyang Zhang, Hu Zhang, and Zhedong Zheng. Vl- uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation, 2024. 3

work page 2024
[60]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 2

work page 2023
[61]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint, 2023

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint, 2023. 2 11

work page 2023

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Llm-safety evaluations lack robustness, 2025

Tim Beyer, Sophie Xhonneux, Simon Geisler, Gauthier Gidel, Leo Schwinn, and Stephan G ¨unnemann. Llm-safety evaluations lack robustness, 2025. 3

work page 2025

[3] [3]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv:2308.13418, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024. 2

work page 2024

[5] [5]

Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation, 2024. 5, 8

work page 2024

[6] [6]

Enhancing uncertainty modeling with se- mantic graph for hallucination detection.arXiv preprint arXiv:2501.02020, 2025

Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, Feng Zheng, and Liang He. Enhancing uncertainty modeling with se- mantic graph for hallucination detection.arXiv preprint arXiv:2501.02020, 2025. 3

work page arXiv 2025

[7] [7]

Beyond factuality: A comprehensive evaluation of large lan- guage models as knowledge generators

Liang Chen, Yang Deng, Yatao Bian, Zeyu Qin, and et al. Beyond factuality: A comprehensive evaluation of large lan- guage models as knowledge generators. InEMNLP 2023, pages 6325–6341. Association for Computational Linguis- tics, 2023. 3

work page 2023

[8] [8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

work page

[10] [10]

Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, and Philip S. Yu. Harnessing multiple large language mod- els: A survey on llm ensemble, 2025. 3, 9

work page 2025

[11] [11]

Collective reasoning among llms a framework for answer validation without ground truth, 2025

Seyed Pouyan Mousavi Davoudi, Alireza Shafiee Fard, and Alireza Amiri-Margavi. Collective reasoning among llms a framework for answer validation without ground truth, 2025. 9

work page 2025

[12] [12]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 5

work page 2024

[13] [13]

Post-processing techniques applied to speech recognition output

Jonathan G Fiscus. Post-processing techniques applied to speech recognition output. InDARPA Speech Recognition Workshop, pages 59–62, 1997. Also known as ROVER (Rec- ognizer Output V oting Error Reduction). 5, 8

work page 1997

[14] [14]

Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning,

Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

work page

[15] [15]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059. PMLR, 2016. 3

work page 2016

[16] [16]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):1–17, 2024. 2

work page 2024

[17] [17]

Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024

[18] [18]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xue- hao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Smoothie: Label free language model routing

Neel Guha, Mayee F Chen, Trevor Chow, Ishan S Khare, and Christopher Re. Smoothie: Label free language model routing. InNeuIPS, 2024. 3

work page 2024

[20] [21]

Opendatalab: Empowering general ar- 9 tificial intelligence with open datasets.arXiv preprint arXiv:2407.13773, 2024

Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, and Dahua Lin. Opendatalab: Empowering general ar- 9 tificial intelligence with open datasets.arXiv preprint arXiv:2407.13773, 2024. 2

work page arXiv 2024

[21] [22]

Layoutlmv3: Pre-training for document ai with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia, pages 4083–4091,

work page

[22] [23]

Look before you leap: An exploratory study of uncertainty measurement for large language models.CoRR, abs/2307.10236, 2023

Yuheng Huang, Jiayang Song, Zhijie Wang, Huaming Chen, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models.CoRR, abs/2307.10236, 2023. 3

work page arXiv 2023

[23] [24]

Llm- blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm- blender: Ensembling large language models with pairwise ranking and generative fusion. InACL, 2023. 3

work page 2023

[24] [25]

What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017. 3

work page 2017

[25] [26]

Prometheus: Inducing fine- grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sung- dong Kim, James Thorne, et al. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representa- tions, 2023. 2

work page 2023

[26] [27]

Preference leakage: A contamination problem in llm- as-a-judge, 2025

Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. Preference leakage: A contamination problem in llm- as-a-judge, 2025. 3

work page 2025

[27] [28]

More agents is all you need

Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and De- heng Ye. More agents is all you need.arXiv preprint arXiv:2402.05120, 2024. 3

work page arXiv 2024

[28] [29]

Evaluating object hallucination in large vision- language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision- language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, 2023. 3

work page 2023

[29] [30]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computa- tional Linguistics. 3

work page 2004

[30] [31]

Focus anywhere for fine- grained multi-page document understanding

Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Focus anywhere for fine-grained multi-page document understanding.arXiv:2405.14295,

work page arXiv

[31] [32]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2

work page 2023

[32] [33]

Uncertainty quantification and confidence calibration in large language models: A survey, 2025

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey, 2025. 3

work page 2025

[33] [34]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12), 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12), 2024. 2, 5

work page 2024

[34] [35]

Calibrating llm-based evaluator

Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. Calibrating llm-based evaluator. InProceed- ings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2638–2656, 2024. 3

work page 2024

[35] [36]

Grobid.https : / / github

Patrice Lopez. Grobid.https : / / github . com / kermitt2/grobid, 2008–2025. 2

work page 2008

[36] [37]

Urg: A unified ranking and generation method for ensembling language models

Bo Lv, Chen Tang, Yanan Zhang, Xin Liu, Ping Luo, and Yue Yu. Urg: A unified ranking and generation method for ensembling language models. InFindings of the ACL, 2024. 3

work page 2024

[37] [38]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models. InEMNLP 2023, pages 9004–9017. Association for Computational Linguis- tics, 2023. 3

work page 2023

[38] [39]

His- torical review of ocr research and development.Proceedings of the IEEE, 80(7):1029–1058, 1992

Shunji Mori, Ching Y Suen, and Kazuhiko Yamamoto. His- torical review of ocr research and development.Proceedings of the IEEE, 80(7):1029–1058, 1992. 2

work page 1992

[39] [40]

Gpt-4o technical overview.https://openai

OpenAI. Gpt-4o technical overview.https://openai. com/index/gpt- 4o- system- card/, 2024. Ac- cessed: 2025-04-06. 2, 5

work page 2024

[40] [41]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page

[41] [42]

Un- certainty quantification via stable distribution propagation,

Felix Petersen, Aashwin Mishra, Hilde Kuehne, Christian Borgelt, Oliver Deussen, and Mikhail Yurochkin. Un- certainty quantification via stable distribution propagation,

work page

[42] [43]

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025. 2

work page 2025

[43] [44]

Adversarial ml problems are getting harder to solve and to evaluate, 2025

Javier Rando, Jie Zhang, Nicholas Carlini, and Florian Tram`er. Adversarial ml problems are getting harder to solve and to evaluate, 2025. 3

work page 2025

[44] [45]

Vila: Improving struc- tured content extraction from scientific pdfs using visual lay- out groups.Transactions of the Association for Computa- tional Linguistics, 10:376–392, 2022

Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S Weld, and Doug Downey. Vila: Improving struc- tured content extraction from scientific pdfs using visual lay- out groups.Transactions of the Association for Computa- tional Linguistics, 10:376–392, 2022. 2

work page 2022

[45] [46]

Getting more out of mixture of lan- guage model reasoning experts

Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, and Jordan Boyd-Graber. Getting more out of mixture of lan- guage model reasoning experts. InFindings of EMNLP,

work page

[46] [47]

Routledge, 2018

Bernard W Silverman.Density estimation for statistics and data analysis. Routledge, 2018. 3

work page 2018

[47] [48]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri `a Garriga- Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [49]

Llm-topla: Efficient llm ensemble by maximising diversity

Selim Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of EMNLP, 2024. 3 10

work page 2024

[49] [51]

Mineru: An open-source solution for precise document content extrac- tion, 2024

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. Mineru: An open-source solution for precise document content extrac- tion, 2024. 2

work page 2024

[50] [52]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [53]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv

[52] [54]

Latent space chain-of-embedding en- ables output-free llm self-evaluation.arXiv preprint arXiv:2410.13640, 2024

Yiming Wang, Pei Zhang, Baosong Yang, Derek F Wong, and Rui Wang. Latent space chain-of-embedding en- ables output-free llm self-evaluation.arXiv preprint arXiv:2410.13640, 2024. 3, 9

work page arXiv 2024

[53] [55]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv:2409.01704, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [56]

Can LLMs express their un- certainty? an empirical evaluation of confidence elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their un- certainty? an empirical evaluation of confidence elicitation in LLMs. InThe Twelfth International Conference on Learn- ing Representations, 2024. 3

work page 2024

[55] [57]

Pub- laynet: largest dataset ever for document layout analysis

Zhong Xu, Jianbin Tang, and Antonio Jimeno Yepes. Pub- laynet: largest dataset ever for document layout analysis. In2019 International conference on document analysis and recognition, pages 1015–1022, 2019. 2

work page 2019

[56] [58]

Cc-ocr: A comprehensive and challenging ocr benchmark for evalu- ating large multimodal models in literacy, 2024

Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, and Junyang Lin. Cc-ocr: A comprehensive and challenging ocr benchmark for evalu- ating large multimodal models in literacy, 2024. 2, 5

work page 2024

[57] [59]

Vl- uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation, 2024

Ruiyang Zhang, Hu Zhang, and Zhedong Zheng. Vl- uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation, 2024. 3

work page 2024

[58] [60]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 2

work page 2023

[59] [61]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint, 2023

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint, 2023. 2 11

work page 2023