pith. machine review for the scientific record. sign in

arxiv: 2604.25359 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI

Recognition: unknown

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

Abhinav Kumar Singh, Harsha Vardhan Khurdula, Vineet Agarwal, Yoeven D Khemlani

Pith reviewed 2026-05-07 16:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords structured outputbenchmarklarge language modelsschema compliancevalue accuracymulti-modal extractionJSON schema
0
0 comments X

The pith

Large language models follow JSON schemas almost perfectly yet extract correct leaf values only 83 percent of the time from text, 67 percent from images, and 24 percent from audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new benchmark that tests large language models on producing structured JSON outputs from text documents, OCR-processed images, and audio transcripts. To isolate the structured-output skill itself, every input is first converted into a plain text representation before the model sees it. Across 21 models the results show near-perfect adherence to the requested schema but far lower success at matching the actual ground-truth values inside those fields, with accuracy falling sharply when the source context grows longer. The benchmark supplies thousands of verified question-schema-answer triples drawn from real documents and conversations to support consistent measurement across modalities.

Core claim

Models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0 percent on text, 67.2 percent on images, and 23.7 percent on audio, where longer context makes extraction substantially harder.

What carries the argument

The Structured Output Benchmark (SOB), a collection of 5,000 text, 209 image, and 115 audio records each pairing a natural-language question, a JSON schema, and a verified ground-truth answer, with all modalities supplied as text-normalized context to isolate structured-output performance.

If this is right

  • Schema compliance alone is not a reliable proxy for end-to-end accuracy in structured data extraction tasks.
  • Value extraction performance degrades with longer contexts, so context-handling improvements would directly raise accuracy on complex documents.
  • Real-world uses such as invoice parsing or medical record structuring require additional verification steps beyond current model outputs.
  • Benchmarks limited to single modalities or to schema checks alone will understate the gap between model behavior and application requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Further gains may come from better normalization pipelines that preserve layout or speaker cues before the model processes the input.
  • Hybrid pipelines that combine language models with traditional parsing tools could close the remaining value-accuracy gap for production use.
  • Extending the benchmark to raw image or audio inputs without normalization would reveal how much current perception limits are still hidden.
  • Models with stronger long-context retrieval mechanisms should be tested first on the audio subset to isolate whether context length is the dominant bottleneck.

Load-bearing premise

Converting images and audio into text representations fully isolates structured-output capability without losing information or introducing bias from the conversion step.

What would settle it

If a model achieves substantially higher value accuracy on the audio portion than on the text portion while context lengths are matched, the reported pattern that longer context drives the difficulty would be contradicted.

Figures

Figures reproduced from arXiv: 2604.25359 by Abhinav Kumar Singh, Harsha Vardhan Khurdula, Vineet Agarwal, Yoeven D Khemlani.

Figure 1
Figure 1. Figure 1: SOB evaluation pipeline. Each source record (context, question, JSON schema) is submitted to the candidate model. The response is first checked for parse validity, then schema compliance: failures trigger the hardening rule (semantic scores zeroed). Passing responses are path-flattened to leaf nodes with concrete array indices, then compared field-by-field against ground truth to yield the seven evaluation… view at source ↗
Figure 2
Figure 2. Figure 2: JSON Pass Rate (black) vs Value Accuracy (purple) in percentages across all 21 evaluated view at source ↗
Figure 3
Figure 3. Figure 3: A complete benchmark record. The model receives context, question, and schema, and view at source ↗
Figure 4
Figure 4. Figure 4: Full SOB data pipeline. Three source-specific loaders (HotpotQA, olmOCR-bench, AMI Corpus) feed into a human-authoring step, followed by Pydantic validation and an LLM cross-check (Gemini 2.5 Flash for per-record review and Gemini 2.5 Pro for quality scoring) before records are accepted. All three source domains follow identical authoring, validation, and assembly stages, differing only in data loading and… view at source ↗
read the original abstract

Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full corpus, 209 image records from OCR-processed PDFs across seven document types including multi-column layouts, dense tables, scanned historical documents, small-print text, and mathematical typesetting, and 115 audio records from the AMI corpus. Each record pairs a natural-language question with a JSON schema that the model must follow and a ground-truth answer verified against the source context. We evaluate 21 frontier and open-weight models across three source domains and seven metrics. Our results reveal a consistent pattern: models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0% on text, 67.2% on images, and 23.7% on audio, where longer context makes extraction substantially harder. We release the dataset, evaluation pipeline, and all related code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Structured Output Benchmark (SOB), a multi-source evaluation framework for LLM structured output generation. It covers 5,000 text records derived from multi-hop QA, 209 image records from OCR-processed PDFs spanning seven document types (including dense tables and small-print content), and 115 audio records from AMI corpus transcripts. All models receive text-normalized inputs to isolate structured-output quality from modality-specific processing. Evaluation of 21 models shows near-perfect schema compliance across modalities, but value accuracy (exact leaf-value match to ground truth) reaches at most 83.0% on text, 67.2% on images, and 23.7% on audio, with longer contexts increasing difficulty. The dataset, evaluation pipeline, and code are released publicly.

Significance. If the central results hold after addressing input-fidelity concerns, SOB fills a clear gap by moving beyond schema-compliance-only benchmarks to quantify value accuracy across modalities in a controlled setting. The consistent pattern of high compliance but modality-dependent accuracy drops, combined with the public release of resources, would make it a useful reference for tracking progress in extraction tasks such as invoice parsing or meeting summarization. The design of providing normalized text to all models supports fair cross-modal comparison, provided the normalization quality is quantified.

major comments (2)
  1. [Abstract / benchmark construction] Abstract and benchmark construction section: The central design claim that text-normalized representations (OCR output for the 209 image records; transcripts for the 115 audio records) 'isolate structured-output capability from raw vision or speech-processing quality' is load-bearing for attributing the observed value-accuracy gaps (67.2% images, 23.7% audio) to structured-output limitations rather than input degradation. No OCR word-error rate, transcription fidelity metric, or audit of ground-truth recoverability from the normalized text is reported, even though ground-truth JSON values are verified against the original source context. This omission prevents unambiguous attribution, especially for challenging cases such as dense tables, small-print text, mathematical typesetting in images, and multi-speaker audio.
  2. [Evaluation and results] Evaluation metrics and results: Value Accuracy is defined via exact leaf-value match, yet the manuscript provides limited detail on handling of metric edge cases (e.g., numeric formatting, date normalization, or partial matches in complex nested JSON). This affects interpretation of the headline numbers (83.0% text, 67.2% images, 23.7% audio) and the claim that longer context makes extraction substantially harder.
minor comments (2)
  1. [Abstract] Abstract: The description of the seven metrics could be expanded with one-sentence definitions to improve immediate clarity for readers.
  2. [Dataset construction] The paper would benefit from an explicit statement of how many of the 5,000 text records overlap with the full 25,091-record corpus and whether any filtering was applied to ensure ground-truth verifiability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and indicate the revisions made.

read point-by-point responses
  1. Referee: [Abstract / benchmark construction] Abstract and benchmark construction section: The central design claim that text-normalized representations (OCR output for the 209 image records; transcripts for the 115 audio records) 'isolate structured-output capability from raw vision or speech-processing quality' is load-bearing for attributing the observed value-accuracy gaps (67.2% images, 23.7% audio) to structured-output limitations rather than input degradation. No OCR word-error rate, transcription fidelity metric, or audit of ground-truth recoverability from the normalized text is reported, even though ground-truth JSON values are verified against the original source context. This omission prevents unambiguous attribution, especially for challenging cases such as dense tables, small-print text, mathematical typesetting in images, and multi-speaker audio.

    Authors: We agree that quantifying the fidelity of the text normalization is necessary to fully support our isolation claim. In the revised manuscript we have added a dedicated paragraph in the benchmark construction section that reports OCR word error rates computed on a sample of the image documents against manually verified text, references the known transcription quality metrics for the AMI corpus, and includes a recoverability audit showing that ground-truth JSON values remain extractable from the normalized text in the large majority of cases. We also explicitly discuss the handling of the challenging document types noted by the referee. revision: yes

  2. Referee: [Evaluation and results] Evaluation metrics and results: Value Accuracy is defined via exact leaf-value match, yet the manuscript provides limited detail on handling of metric edge cases (e.g., numeric formatting, date normalization, or partial matches in complex nested JSON). This affects interpretation of the headline numbers (83.0% text, 67.2% images, 23.7% audio) and the claim that longer context makes extraction substantially harder.

    Authors: We accept that additional implementation details are required for reproducibility. The revised manuscript expands the evaluation metrics section to describe our exact procedures: numeric values are compared after stripping formatting characters and standardizing representation; dates are converted to ISO 8601 before comparison; only exact leaf-value matches are counted as correct with no partial credit; and the metric is applied recursively to all leaves in nested JSON. We also add a short analysis confirming the context-length effect under these rules. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent evaluation data

full rationale

This paper introduces a new multi-source benchmark (SOB) with 5,000 text records, 209 image records, and 115 audio records, evaluates 21 models on schema compliance and value accuracy, and reports direct empirical results. No derivations, equations, fitted parameters, or self-referential reductions exist; the central claims rest on new data collection, OCR/transcript normalization as an explicit design choice, and ground-truth verification against source context. The evaluation pipeline and dataset are released for external reproduction, making the results self-contained against external benchmarks rather than internally forced by construction or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on standard evaluation assumptions rather than new free parameters or invented entities; ground-truth verification and text normalization are treated as domain practices.

axioms (2)
  • domain assumption Ground-truth answers have been accurately verified against the original source context for all 5,000+ records.
    Value Accuracy metric depends directly on the correctness of these verified answers.
  • domain assumption Text normalization of images and audio preserves all information required for structured extraction.
    This assumption underpins the claim that the benchmark isolates structured-output capability from modality-specific processing.

pith-pipeline@v0.9.0 · 5604 in / 1305 out tokens · 66400 ms · 2026-05-07T16:27:23.090386+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

    cs.IR 2026-05 unverdicted novelty 7.0

    Reddit2Deezer supplies 190k authentic Reddit dialogues grounded in Deezer music entities for scalable conversational music recommendation research.

Reference graph

Works this paper leans on

37 extracted references · 34 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    JSONSchemaBench: Eval- uating Structured Output Generation in Large Language Models,

    S. Geng et al. JSONSchemaBench : A rigorous benchmark of structured outputs for LMs. arXiv:2501.10868, 2025. DOI: https://doi.org/10.48550/arXiv.2501.10868

  2. [2]

    StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

    J. Li et al. StructEval : Benchmarking LLMs' capabilities to generate structural outputs. arXiv:2505.20139, 2025. DOI: https://doi.org/10.48550/arXiv.2505.20139

  3. [3]

    Instruction-Following Evaluation for Large Language Models

    J. Zhou et al. Instruction-following evaluation for large language models. arXiv:2311.07911, 2023. DOI: https://doi.org/10.48550/arXiv.2311.07911

  4. [4]

    ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction.arXiv preprint arXiv:2602.12247.2026

    N. Ferguson, J. Pennington, N. Beghian, A. Mohan, D. Kiela, S. Agrawal, and T. H. Nguyen. ExtractBench : A benchmark and evaluation methodology for complex structured extraction. arXiv:2602.12247, 2026. DOI: https://doi.org/10.48550/arXiv.2602.12247

  5. [5]

    B. T. Willard and R. Louf. Efficient guided generation for large language models. arXiv:2307.09702, 2023. DOI: https://doi.org/10.48550/arXiv.2307.09702

  6. [6]

    XGram- mar: Efficient Structured Generation via Grammar-Constrained Decoding,

    Y. Dong et al. XGrammar : Flexible and efficient structured generation engine for LLMs. arXiv:2411.15100, 2024. DOI: https://doi.org/10.48550/arXiv.2411.15100

  7. [7]

    Gonzalez, Clark Barrett, and Ying Sheng

    L. Zheng et al. SGLang : Efficient execution of structured language model programs. In NeurIPS, 2024. DOI: https://doi.org/10.52202/079017-2000

  8. [8]

    Park et al

    K. Park et al. Grammar-aligned decoding. In NeurIPS, 2024

  9. [9]

    Z. R. Tam et al. Let me speak freely? A study on format restrictions on LLM performance. In EMNLP, 2024. DOI: https://doi.org/10.18653/v1/2024.emnlp-industry.91

  10. [10]

    arXiv:2509.25922, 2025

    DeepJSONEval : Benchmarking complex nested JSON data mining for LLMs. arXiv:2509.25922, 2025. DOI: https://doi.org/10.48550/arXiv.2509.25922

  11. [11]

    Llmstructbench: Benchmarking large language model structured data extraction.arXiv preprint arXiv:2602.14743, 2026

    S. Tenckhoff et al. LLMStructBench : Benchmarking LLM structured data extraction. arXiv:2602.14743, 2026. DOI: https://doi.org/10.48550/arXiv.2602.14743

  12. [12]

    Wang et al

    G. Wang et al. STED and consistency scoring: Evaluating LLM structured output reliability. arXiv:2512.23712, 2025. DOI: https://doi.org/10.48550/arXiv.2512.23712

  13. [13]

    Chen et al

    Y. Chen et al. A unified view of evaluation metrics for structured prediction. In EMNLP, 2023. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.795

  14. [14]

    Jiang et al

    Y. Jiang et al. FollowBench : A multi-level fine-grained constraints following benchmark. In ACL, 2024. DOI: https://doi.org/10.18653/v1/2024.acl-long.257

  15. [15]

    13025–13048 (2024)

    Y. Qin et al. InFoBench : Evaluating instruction following ability in LLMs. In ACL Findings, 2024. DOI: https://doi.org/10.18653/v1/2024.findings-acl.772

  16. [16]

    Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

    V. Pyatkin et al. Generalizing verifiable instruction following. In NeurIPS D&B, 2025. DOI: https://doi.org/10.48550/arXiv.2507.02833

  17. [17]

    doi: 10.1007/ 978-3-031-72658-3 13

    Y. Liu et al. MMBench : Is your multi-modal model an all-around player? In ECCV, 2024. DOI: https://doi.org/10.1007/978-3-031-72658-3_13

  18. [18]

    Vbench: Comprehensive benchmark suite for video generative models

    X. Yue et al. MMMU : A massive multi-discipline multimodal understanding benchmark. In CVPR, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.00913

  19. [19]

    Mathew et al

    M. Mathew et al. DocVQA : A dataset for VQA on document images. In WACV, 2021. DOI: https://doi.org/10.1109/WACV48630.2021.00225

  20. [20]

    doi: 10.18653/v1/2022.findings-acl.177

    A. Masry et al. ChartQA : A benchmark for question answering about charts. In ACL Findings, 2022. DOI: https://doi.org/10.18653/v1/2022.findings-acl.177

  21. [21]

    Mathew et al

    M. Mathew et al. InfographicVQA . In WACV, 2022. DOI: https://doi.org/10.1109/WACV51458.2022.00264

  22. [22]
  23. [23]

    Wang et al

    B. Wang et al. AudioBench : A universal benchmark for audio LLMs. In NAACL, 2025. DOI: https://doi.org/10.18653/v1/2025.naacl-long.218

  24. [24]

    doi: 10.18653/v1/2023.emnlp-main.741

    S. Min et al. FActScore : Fine-grained atomic evaluation of factual precision. In EMNLP, 2023. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.741

  25. [25]

    S. G. Patil et al. Gorilla: LLM connected with massive APIs. In NeurIPS, 2024. DOI: https://doi.org/10.48550/arXiv.2305.15334

  26. [26]

    Yan et al

    F. Yan et al. The Berkeley function calling leaderboard. 2024

  27. [27]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Y. Qin et al. ToolLLM : Facilitating LLMs to master 16000+ real-world APIs. In ICLR, 2024. DOI: https://doi.org/10.48550/arXiv.2307.16789

  28. [28]

    , booktitle =

    Z. Yang et al. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018. DOI: https://doi.org/10.18653/v1/D18-1259

  29. [29]

    Chiang, J

    L. Zheng et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In NeurIPS, 2023. DOI: https://doi.org/10.52202/075280-2020

  30. [30]

    D ata D reamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

    A. Patel et al. DataDreamer : Synthetic data generation and reproducible LLM workflows. In ACL, 2024. DOI: https://doi.org/10.18653/v1/2024.acl-long.208

  31. [31]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford et al. Robust speech recognition via large-scale weak supervision. In ICML, 2023. DOI: https://doi.org/10.48550/arXiv.2212.04356

  32. [32]

    ACM Transactions on Information Systems , author =

    L. Huang et al. A survey on hallucination in large language models. ACM TOIS, 2024. DOI: https://doi.org/10.1145/3703155

  33. [33]

    M u S i Q ue: Multihop questions via single-hop question composition

    H. Trivedi et al. MuSiQue : Multihop questions via single-hop question composition. TACL, 2022. DOI: https://doi.org/10.1162/tacl_a_00475

  34. [34]

    Constructing A Multi-hop

    X. Ho et al. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In COLING, 2020. DOI: https://doi.org/10.18653/v1/2020.coling-main.580

  35. [35]

    Carletta et al

    J. Carletta et al. The AMI meeting corpus. 2005

  36. [36]

    arXiv preprint arXiv:2502.18443 , year=

    J. Poznanski et al. olmOCR : Unlocking trillions of tokens in PDFs with vision language models. arXiv:2502.18443, 2025. DOI: https://doi.org/10.48550/arXiv.2502.18443

  37. [37]

    InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

    W. Kwon et al. Efficient memory management for LLM serving with PagedAttention. In SOSP, 2023. DOI: https://doi.org/10.1145/3600006.3613165