arxiv: 2604.25359 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI

Recognition: unknown

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

Abhinav Kumar Singh, Harsha Vardhan Khurdula, Vineet Agarwal, Yoeven D Khemlani

Pith reviewed 2026-05-07 16:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords structured outputbenchmarklarge language modelsschema compliancevalue accuracymulti-modal extractionJSON schema

0 comments

The pith

Large language models follow JSON schemas almost perfectly yet extract correct leaf values only 83 percent of the time from text, 67 percent from images, and 24 percent from audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new benchmark that tests large language models on producing structured JSON outputs from text documents, OCR-processed images, and audio transcripts. To isolate the structured-output skill itself, every input is first converted into a plain text representation before the model sees it. Across 21 models the results show near-perfect adherence to the requested schema but far lower success at matching the actual ground-truth values inside those fields, with accuracy falling sharply when the source context grows longer. The benchmark supplies thousands of verified question-schema-answer triples drawn from real documents and conversations to support consistent measurement across modalities.

Core claim

Models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0 percent on text, 67.2 percent on images, and 23.7 percent on audio, where longer context makes extraction substantially harder.

What carries the argument

The Structured Output Benchmark (SOB), a collection of 5,000 text, 209 image, and 115 audio records each pairing a natural-language question, a JSON schema, and a verified ground-truth answer, with all modalities supplied as text-normalized context to isolate structured-output performance.

If this is right

Schema compliance alone is not a reliable proxy for end-to-end accuracy in structured data extraction tasks.
Value extraction performance degrades with longer contexts, so context-handling improvements would directly raise accuracy on complex documents.
Real-world uses such as invoice parsing or medical record structuring require additional verification steps beyond current model outputs.
Benchmarks limited to single modalities or to schema checks alone will understate the gap between model behavior and application requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Further gains may come from better normalization pipelines that preserve layout or speaker cues before the model processes the input.
Hybrid pipelines that combine language models with traditional parsing tools could close the remaining value-accuracy gap for production use.
Extending the benchmark to raw image or audio inputs without normalization would reveal how much current perception limits are still hidden.
Models with stronger long-context retrieval mechanisms should be tested first on the audio subset to isolate whether context length is the dominant bottleneck.

Load-bearing premise

Converting images and audio into text representations fully isolates structured-output capability without losing information or introducing bias from the conversion step.

What would settle it

If a model achieves substantially higher value accuracy on the audio portion than on the text portion while context lengths are matched, the reported pattern that longer context drives the difficulty would be contradicted.

Figures

Figures reproduced from arXiv: 2604.25359 by Abhinav Kumar Singh, Harsha Vardhan Khurdula, Vineet Agarwal, Yoeven D Khemlani.

**Figure 1.** Figure 1: SOB evaluation pipeline. Each source record (context, question, JSON schema) is submitted to the candidate model. The response is first checked for parse validity, then schema compliance: failures trigger the hardening rule (semantic scores zeroed). Passing responses are path-flattened to leaf nodes with concrete array indices, then compared field-by-field against ground truth to yield the seven evaluation… view at source ↗

**Figure 2.** Figure 2: JSON Pass Rate (black) vs Value Accuracy (purple) in percentages across all 21 evaluated view at source ↗

**Figure 3.** Figure 3: A complete benchmark record. The model receives context, question, and schema, and view at source ↗

**Figure 4.** Figure 4: Full SOB data pipeline. Three source-specific loaders (HotpotQA, olmOCR-bench, AMI Corpus) feed into a human-authoring step, followed by Pydantic validation and an LLM cross-check (Gemini 2.5 Flash for per-record review and Gemini 2.5 Pro for quality scoring) before records are accepted. All three source domains follow identical authoring, validation, and assembly stages, differing only in data loading and… view at source ↗

read the original abstract

Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full corpus, 209 image records from OCR-processed PDFs across seven document types including multi-column layouts, dense tables, scanned historical documents, small-print text, and mathematical typesetting, and 115 audio records from the AMI corpus. Each record pairs a natural-language question with a JSON schema that the model must follow and a ground-truth answer verified against the source context. We evaluate 21 frontier and open-weight models across three source domains and seven metrics. Our results reveal a consistent pattern: models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0% on text, 67.2% on images, and 23.7% on audio, where longer context makes extraction substantially harder. We release the dataset, evaluation pipeline, and all related code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark shows models nail schema compliance but lag on exact value accuracy, especially from images and audio, though the normalization step needs auditing to confirm the cause.

read the letter

This paper introduces a multi-source benchmark for LLM structured output that spans native text, OCR-processed images from seven document types, and audio transcripts from the AMI corpus. All inputs are normalized to text to focus on output quality rather than vision or speech. They evaluate 21 models and report near-perfect schema compliance but value accuracy maxing at 83% on text, 67% on images, and 24% on audio, with longer contexts hurting more. The dataset release and evaluation code are practical additions for anyone running extraction pipelines. The design extends single-domain schema tests by adding deliberate cross-source comparison and text normalization for fairness. The empirical pattern is clear and matches real deployment issues in document processing. The main limitation is that the normalization step itself is not measured for information loss. No OCR error rates, transcription fidelity checks, or audit of whether ground-truth values remain recoverable from the provided text are reported. This leaves open the possibility that some accuracy drops trace to degraded inputs rather than the models' structured reasoning. Ground-truth verification procedures are described at a high level but lack edge-case details. The citation pattern is standard and appropriate for a benchmark paper. This work is aimed at teams building or evaluating data extraction systems from mixed sources. A reader focused on LLM evals or practical NLP tooling would get direct value from the numbers and the released set. It deserves peer review because the core results are reproducible and address a real gap, even if the methods section would benefit from tighter quantification of the normalization step.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Structured Output Benchmark (SOB), a multi-source evaluation framework for LLM structured output generation. It covers 5,000 text records derived from multi-hop QA, 209 image records from OCR-processed PDFs spanning seven document types (including dense tables and small-print content), and 115 audio records from AMI corpus transcripts. All models receive text-normalized inputs to isolate structured-output quality from modality-specific processing. Evaluation of 21 models shows near-perfect schema compliance across modalities, but value accuracy (exact leaf-value match to ground truth) reaches at most 83.0% on text, 67.2% on images, and 23.7% on audio, with longer contexts increasing difficulty. The dataset, evaluation pipeline, and code are released publicly.

Significance. If the central results hold after addressing input-fidelity concerns, SOB fills a clear gap by moving beyond schema-compliance-only benchmarks to quantify value accuracy across modalities in a controlled setting. The consistent pattern of high compliance but modality-dependent accuracy drops, combined with the public release of resources, would make it a useful reference for tracking progress in extraction tasks such as invoice parsing or meeting summarization. The design of providing normalized text to all models supports fair cross-modal comparison, provided the normalization quality is quantified.

major comments (2)

[Abstract / benchmark construction] Abstract and benchmark construction section: The central design claim that text-normalized representations (OCR output for the 209 image records; transcripts for the 115 audio records) 'isolate structured-output capability from raw vision or speech-processing quality' is load-bearing for attributing the observed value-accuracy gaps (67.2% images, 23.7% audio) to structured-output limitations rather than input degradation. No OCR word-error rate, transcription fidelity metric, or audit of ground-truth recoverability from the normalized text is reported, even though ground-truth JSON values are verified against the original source context. This omission prevents unambiguous attribution, especially for challenging cases such as dense tables, small-print text, mathematical typesetting in images, and multi-speaker audio.
[Evaluation and results] Evaluation metrics and results: Value Accuracy is defined via exact leaf-value match, yet the manuscript provides limited detail on handling of metric edge cases (e.g., numeric formatting, date normalization, or partial matches in complex nested JSON). This affects interpretation of the headline numbers (83.0% text, 67.2% images, 23.7% audio) and the claim that longer context makes extraction substantially harder.

minor comments (2)

[Abstract] Abstract: The description of the seven metrics could be expanded with one-sentence definitions to improve immediate clarity for readers.
[Dataset construction] The paper would benefit from an explicit statement of how many of the 5,000 text records overlap with the full 25,091-record corpus and whether any filtering was applied to ensure ground-truth verifiability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and indicate the revisions made.

read point-by-point responses

Referee: [Abstract / benchmark construction] Abstract and benchmark construction section: The central design claim that text-normalized representations (OCR output for the 209 image records; transcripts for the 115 audio records) 'isolate structured-output capability from raw vision or speech-processing quality' is load-bearing for attributing the observed value-accuracy gaps (67.2% images, 23.7% audio) to structured-output limitations rather than input degradation. No OCR word-error rate, transcription fidelity metric, or audit of ground-truth recoverability from the normalized text is reported, even though ground-truth JSON values are verified against the original source context. This omission prevents unambiguous attribution, especially for challenging cases such as dense tables, small-print text, mathematical typesetting in images, and multi-speaker audio.

Authors: We agree that quantifying the fidelity of the text normalization is necessary to fully support our isolation claim. In the revised manuscript we have added a dedicated paragraph in the benchmark construction section that reports OCR word error rates computed on a sample of the image documents against manually verified text, references the known transcription quality metrics for the AMI corpus, and includes a recoverability audit showing that ground-truth JSON values remain extractable from the normalized text in the large majority of cases. We also explicitly discuss the handling of the challenging document types noted by the referee. revision: yes
Referee: [Evaluation and results] Evaluation metrics and results: Value Accuracy is defined via exact leaf-value match, yet the manuscript provides limited detail on handling of metric edge cases (e.g., numeric formatting, date normalization, or partial matches in complex nested JSON). This affects interpretation of the headline numbers (83.0% text, 67.2% images, 23.7% audio) and the claim that longer context makes extraction substantially harder.

Authors: We accept that additional implementation details are required for reproducibility. The revised manuscript expands the evaluation metrics section to describe our exact procedures: numeric values are compared after stripping formatting characters and standardizing representation; dates are converted to ISO 8601 before comparison; only exact leaf-value matches are counted as correct with no partial credit; and the metric is applied recursively to all leaves in nested JSON. We also add a short analysis confirming the context-length effect under these rules. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent evaluation data

full rationale

This paper introduces a new multi-source benchmark (SOB) with 5,000 text records, 209 image records, and 115 audio records, evaluates 21 models on schema compliance and value accuracy, and reports direct empirical results. No derivations, equations, fitted parameters, or self-referential reductions exist; the central claims rest on new data collection, OCR/transcript normalization as an explicit design choice, and ground-truth verification against source context. The evaluation pipeline and dataset are released for external reproduction, making the results self-contained against external benchmarks rather than internally forced by construction or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on standard evaluation assumptions rather than new free parameters or invented entities; ground-truth verification and text normalization are treated as domain practices.

axioms (2)

domain assumption Ground-truth answers have been accurately verified against the original source context for all 5,000+ records.
Value Accuracy metric depends directly on the correctness of these verified answers.
domain assumption Text normalization of images and audio preserves all information required for structured extraction.
This assumption underpins the claim that the benchmark isolates structured-output capability from modality-specific processing.

pith-pipeline@v0.9.0 · 5604 in / 1305 out tokens · 66400 ms · 2026-05-07T16:27:23.090386+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation
cs.IR 2026-05 unverdicted novelty 7.0

Reddit2Deezer supplies 190k authentic Reddit dialogues grounded in Deezer music entities for scalable conversational music recommendation research.

Reference graph

Works this paper leans on

37 extracted references · 34 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

JSONSchemaBench: Eval- uating Structured Output Generation in Large Language Models,

S. Geng et al. JSONSchemaBench : A rigorous benchmark of structured outputs for LMs. arXiv:2501.10868, 2025. DOI: https://doi.org/10.48550/arXiv.2501.10868

work page doi:10.48550/arxiv.2501.10868 2025
[2]

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

J. Li et al. StructEval : Benchmarking LLMs' capabilities to generate structural outputs. arXiv:2505.20139, 2025. DOI: https://doi.org/10.48550/arXiv.2505.20139

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.20139 2025
[3]

Instruction-Following Evaluation for Large Language Models

J. Zhou et al. Instruction-following evaluation for large language models. arXiv:2311.07911, 2023. DOI: https://doi.org/10.48550/arXiv.2311.07911

work page Pith review doi:10.48550/arxiv.2311.07911 2023
[4]

ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction.arXiv preprint arXiv:2602.12247.2026

N. Ferguson, J. Pennington, N. Beghian, A. Mohan, D. Kiela, S. Agrawal, and T. H. Nguyen. ExtractBench : A benchmark and evaluation methodology for complex structured extraction. arXiv:2602.12247, 2026. DOI: https://doi.org/10.48550/arXiv.2602.12247

work page doi:10.48550/arxiv.2602.12247 2026
[5]

B. T. Willard and R. Louf. Efficient guided generation for large language models. arXiv:2307.09702, 2023. DOI: https://doi.org/10.48550/arXiv.2307.09702

work page internal anchor Pith review doi:10.48550/arxiv.2307.09702 2023
[6]

XGram- mar: Efficient Structured Generation via Grammar-Constrained Decoding,

Y. Dong et al. XGrammar : Flexible and efficient structured generation engine for LLMs. arXiv:2411.15100, 2024. DOI: https://doi.org/10.48550/arXiv.2411.15100

work page doi:10.48550/arxiv.2411.15100 2024
[7]

Gonzalez, Clark Barrett, and Ying Sheng

L. Zheng et al. SGLang : Efficient execution of structured language model programs. In NeurIPS, 2024. DOI: https://doi.org/10.52202/079017-2000

work page doi:10.52202/079017-2000 2024
[8]

Park et al

K. Park et al. Grammar-aligned decoding. In NeurIPS, 2024

2024
[9]

Z. R. Tam et al. Let me speak freely? A study on format restrictions on LLM performance. In EMNLP, 2024. DOI: https://doi.org/10.18653/v1/2024.emnlp-industry.91

work page doi:10.18653/v1/2024.emnlp-industry.91 2024
[10]

arXiv:2509.25922, 2025

DeepJSONEval : Benchmarking complex nested JSON data mining for LLMs. arXiv:2509.25922, 2025. DOI: https://doi.org/10.48550/arXiv.2509.25922

work page doi:10.48550/arxiv.2509.25922 2025
[11]

Llmstructbench: Benchmarking large language model structured data extraction.arXiv preprint arXiv:2602.14743, 2026

S. Tenckhoff et al. LLMStructBench : Benchmarking LLM structured data extraction. arXiv:2602.14743, 2026. DOI: https://doi.org/10.48550/arXiv.2602.14743

work page doi:10.48550/arxiv.2602.14743 2026
[12]

Wang et al

G. Wang et al. STED and consistency scoring: Evaluating LLM structured output reliability. arXiv:2512.23712, 2025. DOI: https://doi.org/10.48550/arXiv.2512.23712

work page doi:10.48550/arxiv.2512.23712 2025
[13]

Chen et al

Y. Chen et al. A unified view of evaluation metrics for structured prediction. In EMNLP, 2023. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.795

work page doi:10.18653/v1/2023.emnlp-main.795 2023
[14]

Jiang et al

Y. Jiang et al. FollowBench : A multi-level fine-grained constraints following benchmark. In ACL, 2024. DOI: https://doi.org/10.18653/v1/2024.acl-long.257

work page doi:10.18653/v1/2024.acl-long.257 2024
[15]

13025–13048 (2024)

Y. Qin et al. InFoBench : Evaluating instruction following ability in LLMs. In ACL Findings, 2024. DOI: https://doi.org/10.18653/v1/2024.findings-acl.772

work page doi:10.18653/v1/2024.findings-acl.772 2024
[16]

Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

V. Pyatkin et al. Generalizing verifiable instruction following. In NeurIPS D&B, 2025. DOI: https://doi.org/10.48550/arXiv.2507.02833

work page doi:10.48550/arxiv.2507.02833 2025
[17]

doi: 10.1007/ 978-3-031-72658-3 13

Y. Liu et al. MMBench : Is your multi-modal model an all-around player? In ECCV, 2024. DOI: https://doi.org/10.1007/978-3-031-72658-3_13

work page doi:10.1007/978-3-031-72658-3_13 2024
[18]

Vbench: Comprehensive benchmark suite for video generative models

X. Yue et al. MMMU : A massive multi-discipline multimodal understanding benchmark. In CVPR, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00913

work page doi:10.1109/cvpr52733.2024.00913 2024
[19]

Mathew et al

M. Mathew et al. DocVQA : A dataset for VQA on document images. In WACV, 2021. DOI: https://doi.org/10.1109/WACV48630.2021.00225

work page doi:10.1109/wacv48630.2021.00225 2021
[20]

doi: 10.18653/v1/2022.findings-acl.177

A. Masry et al. ChartQA : A benchmark for question answering about charts. In ACL Findings, 2022. DOI: https://doi.org/10.18653/v1/2022.findings-acl.177

work page doi:10.18653/v1/2022.findings-acl.177 2022
[21]

Mathew et al

M. Mathew et al. InfographicVQA . In WACV, 2022. DOI: https://doi.org/10.1109/WACV51458.2022.00264

work page doi:10.1109/wacv51458.2022.00264 2022
[22]

OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

L. Fu et al. OCRBench v2. arXiv:2501.00321, 2025. DOI: https://doi.org/10.48550/arXiv.2501.00321

work page doi:10.48550/arxiv.2501.00321 2025
[23]

Wang et al

B. Wang et al. AudioBench : A universal benchmark for audio LLMs. In NAACL, 2025. DOI: https://doi.org/10.18653/v1/2025.naacl-long.218

work page doi:10.18653/v1/2025.naacl-long.218 2025
[24]

doi: 10.18653/v1/2023.emnlp-main.741

S. Min et al. FActScore : Fine-grained atomic evaluation of factual precision. In EMNLP, 2023. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.741

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[25]

S. G. Patil et al. Gorilla: LLM connected with massive APIs. In NeurIPS, 2024. DOI: https://doi.org/10.48550/arXiv.2305.15334

work page internal anchor Pith review doi:10.48550/arxiv.2305.15334 2024
[26]

Yan et al

F. Yan et al. The Berkeley function calling leaderboard. 2024

2024
[27]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Y. Qin et al. ToolLLM : Facilitating LLMs to master 16000+ real-world APIs. In ICLR, 2024. DOI: https://doi.org/10.48550/arXiv.2307.16789

work page Pith review doi:10.48550/arxiv.2307.16789 2024
[28]

, booktitle =

Z. Yang et al. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018. DOI: https://doi.org/10.18653/v1/D18-1259

work page doi:10.18653/v1/d18-1259 2018
[29]

Chiang, J

L. Zheng et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In NeurIPS, 2023. DOI: https://doi.org/10.52202/075280-2020

work page doi:10.52202/075280-2020 2023
[30]

D ata D reamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

A. Patel et al. DataDreamer : Synthetic data generation and reproducible LLM workflows. In ACL, 2024. DOI: https://doi.org/10.18653/v1/2024.acl-long.208

work page doi:10.18653/v1/2024.acl-long.208 2024
[31]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford et al. Robust speech recognition via large-scale weak supervision. In ICML, 2023. DOI: https://doi.org/10.48550/arXiv.2212.04356

work page Pith review doi:10.48550/arxiv.2212.04356 2023
[32]

ACM Transactions on Information Systems , author =

L. Huang et al. A survey on hallucination in large language models. ACM TOIS, 2024. DOI: https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2024
[33]

M u S i Q ue: Multihop questions via single-hop question composition

H. Trivedi et al. MuSiQue : Multihop questions via single-hop question composition. TACL, 2022. DOI: https://doi.org/10.1162/tacl_a_00475

work page doi:10.1162/tacl_a_00475 2022
[34]

Constructing A Multi-hop

X. Ho et al. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In COLING, 2020. DOI: https://doi.org/10.18653/v1/2020.coling-main.580

work page doi:10.18653/v1/2020.coling-main.580 2020
[35]

Carletta et al

J. Carletta et al. The AMI meeting corpus. 2005

2005
[36]

arXiv preprint arXiv:2502.18443 , year=

J. Poznanski et al. olmOCR : Unlocking trillions of tokens in PDFs with vision language models. arXiv:2502.18443, 2025. DOI: https://doi.org/10.48550/arXiv.2502.18443

work page doi:10.48550/arxiv.2502.18443 2025
[37]

InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

W. Kwon et al. Efficient memory management for LLM serving with PagedAttention. In SOSP, 2023. DOI: https://doi.org/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023