Recognition: unknown
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
Pith reviewed 2026-05-07 16:27 UTC · model grok-4.3
The pith
Large language models follow JSON schemas almost perfectly yet extract correct leaf values only 83 percent of the time from text, 67 percent from images, and 24 percent from audio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0 percent on text, 67.2 percent on images, and 23.7 percent on audio, where longer context makes extraction substantially harder.
What carries the argument
The Structured Output Benchmark (SOB), a collection of 5,000 text, 209 image, and 115 audio records each pairing a natural-language question, a JSON schema, and a verified ground-truth answer, with all modalities supplied as text-normalized context to isolate structured-output performance.
If this is right
- Schema compliance alone is not a reliable proxy for end-to-end accuracy in structured data extraction tasks.
- Value extraction performance degrades with longer contexts, so context-handling improvements would directly raise accuracy on complex documents.
- Real-world uses such as invoice parsing or medical record structuring require additional verification steps beyond current model outputs.
- Benchmarks limited to single modalities or to schema checks alone will understate the gap between model behavior and application requirements.
Where Pith is reading between the lines
- Further gains may come from better normalization pipelines that preserve layout or speaker cues before the model processes the input.
- Hybrid pipelines that combine language models with traditional parsing tools could close the remaining value-accuracy gap for production use.
- Extending the benchmark to raw image or audio inputs without normalization would reveal how much current perception limits are still hidden.
- Models with stronger long-context retrieval mechanisms should be tested first on the audio subset to isolate whether context length is the dominant bottleneck.
Load-bearing premise
Converting images and audio into text representations fully isolates structured-output capability without losing information or introducing bias from the conversion step.
What would settle it
If a model achieves substantially higher value accuracy on the audio portion than on the text portion while context lengths are matched, the reported pattern that longer context drives the difficulty would be contradicted.
Figures
read the original abstract
Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full corpus, 209 image records from OCR-processed PDFs across seven document types including multi-column layouts, dense tables, scanned historical documents, small-print text, and mathematical typesetting, and 115 audio records from the AMI corpus. Each record pairs a natural-language question with a JSON schema that the model must follow and a ground-truth answer verified against the source context. We evaluate 21 frontier and open-weight models across three source domains and seven metrics. Our results reveal a consistent pattern: models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0% on text, 67.2% on images, and 23.7% on audio, where longer context makes extraction substantially harder. We release the dataset, evaluation pipeline, and all related code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Structured Output Benchmark (SOB), a multi-source evaluation framework for LLM structured output generation. It covers 5,000 text records derived from multi-hop QA, 209 image records from OCR-processed PDFs spanning seven document types (including dense tables and small-print content), and 115 audio records from AMI corpus transcripts. All models receive text-normalized inputs to isolate structured-output quality from modality-specific processing. Evaluation of 21 models shows near-perfect schema compliance across modalities, but value accuracy (exact leaf-value match to ground truth) reaches at most 83.0% on text, 67.2% on images, and 23.7% on audio, with longer contexts increasing difficulty. The dataset, evaluation pipeline, and code are released publicly.
Significance. If the central results hold after addressing input-fidelity concerns, SOB fills a clear gap by moving beyond schema-compliance-only benchmarks to quantify value accuracy across modalities in a controlled setting. The consistent pattern of high compliance but modality-dependent accuracy drops, combined with the public release of resources, would make it a useful reference for tracking progress in extraction tasks such as invoice parsing or meeting summarization. The design of providing normalized text to all models supports fair cross-modal comparison, provided the normalization quality is quantified.
major comments (2)
- [Abstract / benchmark construction] Abstract and benchmark construction section: The central design claim that text-normalized representations (OCR output for the 209 image records; transcripts for the 115 audio records) 'isolate structured-output capability from raw vision or speech-processing quality' is load-bearing for attributing the observed value-accuracy gaps (67.2% images, 23.7% audio) to structured-output limitations rather than input degradation. No OCR word-error rate, transcription fidelity metric, or audit of ground-truth recoverability from the normalized text is reported, even though ground-truth JSON values are verified against the original source context. This omission prevents unambiguous attribution, especially for challenging cases such as dense tables, small-print text, mathematical typesetting in images, and multi-speaker audio.
- [Evaluation and results] Evaluation metrics and results: Value Accuracy is defined via exact leaf-value match, yet the manuscript provides limited detail on handling of metric edge cases (e.g., numeric formatting, date normalization, or partial matches in complex nested JSON). This affects interpretation of the headline numbers (83.0% text, 67.2% images, 23.7% audio) and the claim that longer context makes extraction substantially harder.
minor comments (2)
- [Abstract] Abstract: The description of the seven metrics could be expanded with one-sentence definitions to improve immediate clarity for readers.
- [Dataset construction] The paper would benefit from an explicit statement of how many of the 5,000 text records overlap with the full 25,091-record corpus and whether any filtering was applied to ensure ground-truth verifiability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and indicate the revisions made.
read point-by-point responses
-
Referee: [Abstract / benchmark construction] Abstract and benchmark construction section: The central design claim that text-normalized representations (OCR output for the 209 image records; transcripts for the 115 audio records) 'isolate structured-output capability from raw vision or speech-processing quality' is load-bearing for attributing the observed value-accuracy gaps (67.2% images, 23.7% audio) to structured-output limitations rather than input degradation. No OCR word-error rate, transcription fidelity metric, or audit of ground-truth recoverability from the normalized text is reported, even though ground-truth JSON values are verified against the original source context. This omission prevents unambiguous attribution, especially for challenging cases such as dense tables, small-print text, mathematical typesetting in images, and multi-speaker audio.
Authors: We agree that quantifying the fidelity of the text normalization is necessary to fully support our isolation claim. In the revised manuscript we have added a dedicated paragraph in the benchmark construction section that reports OCR word error rates computed on a sample of the image documents against manually verified text, references the known transcription quality metrics for the AMI corpus, and includes a recoverability audit showing that ground-truth JSON values remain extractable from the normalized text in the large majority of cases. We also explicitly discuss the handling of the challenging document types noted by the referee. revision: yes
-
Referee: [Evaluation and results] Evaluation metrics and results: Value Accuracy is defined via exact leaf-value match, yet the manuscript provides limited detail on handling of metric edge cases (e.g., numeric formatting, date normalization, or partial matches in complex nested JSON). This affects interpretation of the headline numbers (83.0% text, 67.2% images, 23.7% audio) and the claim that longer context makes extraction substantially harder.
Authors: We accept that additional implementation details are required for reproducibility. The revised manuscript expands the evaluation metrics section to describe our exact procedures: numeric values are compared after stripping formatting characters and standardizing representation; dates are converted to ISO 8601 before comparison; only exact leaf-value matches are counted as correct with no partial credit; and the metric is applied recursively to all leaves in nested JSON. We also add a short analysis confirming the context-length effect under these rules. revision: yes
Circularity Check
No circularity: empirical benchmark with independent evaluation data
full rationale
This paper introduces a new multi-source benchmark (SOB) with 5,000 text records, 209 image records, and 115 audio records, evaluates 21 models on schema compliance and value accuracy, and reports direct empirical results. No derivations, equations, fitted parameters, or self-referential reductions exist; the central claims rest on new data collection, OCR/transcript normalization as an explicit design choice, and ground-truth verification against source context. The evaluation pipeline and dataset are released for external reproduction, making the results self-contained against external benchmarks rather than internally forced by construction or self-citation chains.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Ground-truth answers have been accurately verified against the original source context for all 5,000+ records.
- domain assumption Text normalization of images and audio preserves all information required for structured extraction.
Forward citations
Cited by 1 Pith paper
-
Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation
Reddit2Deezer supplies 190k authentic Reddit dialogues grounded in Deezer music entities for scalable conversational music recommendation research.
Reference graph
Works this paper leans on
-
[1]
JSONSchemaBench: Eval- uating Structured Output Generation in Large Language Models,
S. Geng et al. JSONSchemaBench : A rigorous benchmark of structured outputs for LMs. arXiv:2501.10868, 2025. DOI: https://doi.org/10.48550/arXiv.2501.10868
-
[2]
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
J. Li et al. StructEval : Benchmarking LLMs' capabilities to generate structural outputs. arXiv:2505.20139, 2025. DOI: https://doi.org/10.48550/arXiv.2505.20139
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.20139 2025
-
[3]
Instruction-Following Evaluation for Large Language Models
J. Zhou et al. Instruction-following evaluation for large language models. arXiv:2311.07911, 2023. DOI: https://doi.org/10.48550/arXiv.2311.07911
-
[4]
N. Ferguson, J. Pennington, N. Beghian, A. Mohan, D. Kiela, S. Agrawal, and T. H. Nguyen. ExtractBench : A benchmark and evaluation methodology for complex structured extraction. arXiv:2602.12247, 2026. DOI: https://doi.org/10.48550/arXiv.2602.12247
-
[5]
B. T. Willard and R. Louf. Efficient guided generation for large language models. arXiv:2307.09702, 2023. DOI: https://doi.org/10.48550/arXiv.2307.09702
work page internal anchor Pith review doi:10.48550/arxiv.2307.09702 2023
-
[6]
XGram- mar: Efficient Structured Generation via Grammar-Constrained Decoding,
Y. Dong et al. XGrammar : Flexible and efficient structured generation engine for LLMs. arXiv:2411.15100, 2024. DOI: https://doi.org/10.48550/arXiv.2411.15100
-
[7]
Gonzalez, Clark Barrett, and Ying Sheng
L. Zheng et al. SGLang : Efficient execution of structured language model programs. In NeurIPS, 2024. DOI: https://doi.org/10.52202/079017-2000
-
[8]
Park et al
K. Park et al. Grammar-aligned decoding. In NeurIPS, 2024
2024
-
[9]
Z. R. Tam et al. Let me speak freely? A study on format restrictions on LLM performance. In EMNLP, 2024. DOI: https://doi.org/10.18653/v1/2024.emnlp-industry.91
-
[10]
DeepJSONEval : Benchmarking complex nested JSON data mining for LLMs. arXiv:2509.25922, 2025. DOI: https://doi.org/10.48550/arXiv.2509.25922
-
[11]
S. Tenckhoff et al. LLMStructBench : Benchmarking LLM structured data extraction. arXiv:2602.14743, 2026. DOI: https://doi.org/10.48550/arXiv.2602.14743
-
[12]
G. Wang et al. STED and consistency scoring: Evaluating LLM structured output reliability. arXiv:2512.23712, 2025. DOI: https://doi.org/10.48550/arXiv.2512.23712
-
[13]
Y. Chen et al. A unified view of evaluation metrics for structured prediction. In EMNLP, 2023. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.795
-
[14]
Y. Jiang et al. FollowBench : A multi-level fine-grained constraints following benchmark. In ACL, 2024. DOI: https://doi.org/10.18653/v1/2024.acl-long.257
-
[15]
Y. Qin et al. InFoBench : Evaluating instruction following ability in LLMs. In ACL Findings, 2024. DOI: https://doi.org/10.18653/v1/2024.findings-acl.772
-
[16]
Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,
V. Pyatkin et al. Generalizing verifiable instruction following. In NeurIPS D&B, 2025. DOI: https://doi.org/10.48550/arXiv.2507.02833
-
[17]
doi: 10.1007/ 978-3-031-72658-3 13
Y. Liu et al. MMBench : Is your multi-modal model an all-around player? In ECCV, 2024. DOI: https://doi.org/10.1007/978-3-031-72658-3_13
-
[18]
Vbench: Comprehensive benchmark suite for video generative models
X. Yue et al. MMMU : A massive multi-discipline multimodal understanding benchmark. In CVPR, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00913
-
[19]
M. Mathew et al. DocVQA : A dataset for VQA on document images. In WACV, 2021. DOI: https://doi.org/10.1109/WACV48630.2021.00225
-
[20]
doi: 10.18653/v1/2022.findings-acl.177
A. Masry et al. ChartQA : A benchmark for question answering about charts. In ACL Findings, 2022. DOI: https://doi.org/10.18653/v1/2022.findings-acl.177
-
[21]
M. Mathew et al. InfographicVQA . In WACV, 2022. DOI: https://doi.org/10.1109/WACV51458.2022.00264
-
[22]
L. Fu et al. OCRBench v2. arXiv:2501.00321, 2025. DOI: https://doi.org/10.48550/arXiv.2501.00321
-
[23]
B. Wang et al. AudioBench : A universal benchmark for audio LLMs. In NAACL, 2025. DOI: https://doi.org/10.18653/v1/2025.naacl-long.218
-
[24]
doi: 10.18653/v1/2023.emnlp-main.741
S. Min et al. FActScore : Fine-grained atomic evaluation of factual precision. In EMNLP, 2023. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.741
-
[25]
S. G. Patil et al. Gorilla: LLM connected with massive APIs. In NeurIPS, 2024. DOI: https://doi.org/10.48550/arXiv.2305.15334
work page internal anchor Pith review doi:10.48550/arxiv.2305.15334 2024
-
[26]
Yan et al
F. Yan et al. The Berkeley function calling leaderboard. 2024
2024
-
[27]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Y. Qin et al. ToolLLM : Facilitating LLMs to master 16000+ real-world APIs. In ICLR, 2024. DOI: https://doi.org/10.48550/arXiv.2307.16789
-
[28]
Z. Yang et al. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018. DOI: https://doi.org/10.18653/v1/D18-1259
-
[29]
L. Zheng et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In NeurIPS, 2023. DOI: https://doi.org/10.52202/075280-2020
-
[30]
D ata D reamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows
A. Patel et al. DataDreamer : Synthetic data generation and reproducible LLM workflows. In ACL, 2024. DOI: https://doi.org/10.18653/v1/2024.acl-long.208
-
[31]
Robust Speech Recognition via Large-Scale Weak Supervision
A. Radford et al. Robust speech recognition via large-scale weak supervision. In ICML, 2023. DOI: https://doi.org/10.48550/arXiv.2212.04356
-
[32]
ACM Transactions on Information Systems , author =
L. Huang et al. A survey on hallucination in large language models. ACM TOIS, 2024. DOI: https://doi.org/10.1145/3703155
-
[33]
M u S i Q ue: Multihop questions via single-hop question composition
H. Trivedi et al. MuSiQue : Multihop questions via single-hop question composition. TACL, 2022. DOI: https://doi.org/10.1162/tacl_a_00475
-
[34]
X. Ho et al. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In COLING, 2020. DOI: https://doi.org/10.18653/v1/2020.coling-main.580
-
[35]
Carletta et al
J. Carletta et al. The AMI meeting corpus. 2005
2005
-
[36]
arXiv preprint arXiv:2502.18443 , year=
J. Poznanski et al. olmOCR : Unlocking trillions of tokens in PDFs with vision language models. arXiv:2502.18443, 2025. DOI: https://doi.org/10.48550/arXiv.2502.18443
-
[37]
InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)
W. Kwon et al. Efficient memory management for LLM serving with PagedAttention. In SOSP, 2023. DOI: https://doi.org/10.1145/3600006.3613165
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.