pith. sign in

arxiv: 2505.20139 · v3 · submitted 2025-05-26 · 💻 cs.SE · cs.AI· cs.CL

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Pith reviewed 2026-05-19 12:58 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords LLM evaluationstructured output generationformat adherencestructural correctnessgeneration vs conversionrenderable formatssoftware development workflowsbenchmarking
0
0 comments X

The pith

A new benchmark reveals that even leading LLMs produce flawed structured outputs like JSON, YAML, and HTML in many cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StructEval as a systematic way to test how well large language models create correctly formed outputs in both text-only formats and renderable ones such as HTML or SVG. It divides evaluation into generation tasks that start from plain-language instructions and conversion tasks that move between existing formats, covering 18 formats and 44 task varieties. Novel scores measure whether outputs follow required syntax and preserve intended structure. Results show consistent shortfalls, with top models reaching only moderate overall performance and greater difficulty on generation and visual content than on simpler conversions or text structures.

Core claim

StructEval demonstrates that current LLMs exhibit measurable limitations when required to produce structurally faithful outputs across diverse formats. Through controlled generation and conversion tasks, the evaluation finds that state-of-the-art models achieve an average score of 75.58, with open-source models trailing by roughly ten points. Generation tasks prove harder than conversion tasks, and tasks involving visual renderable content prove harder than text-only structures.

What carries the argument

StructEval benchmark that applies two task paradigms—generation from natural language and format conversion—across 18 formats and 44 task types, scored by new metrics that separately assess format adherence and structural correctness.

If this is right

  • Generation from natural language prompts remains harder for models than translating between already-structured formats.
  • Producing correct visual or renderable content such as HTML, React, or SVG is more error-prone than producing text-only structures such as JSON or CSV.
  • Open-source models continue to lag closed-source counterparts by a consistent margin on structural fidelity.
  • Even the strongest current models leave substantial room for improvement before structured outputs can be treated as reliable without further checking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams that integrate LLMs into code or interface generation pipelines may still require human review or post-processing to catch structural mistakes.
  • Future training approaches could target format-specific patterns to reduce the observed gap between generation and conversion performance.
  • The benchmark design offers a template that could be extended to additional formats or to interactive, multi-turn generation scenarios.
  • Persistent shortfalls on visual structures suggest that current model architectures may under-emphasize spatial or hierarchical relationships compared with linear text.

Load-bearing premise

The chosen metrics for format adherence and structural correctness are assumed to detect the main quality problems without missing important failure modes that vary across the different formats.

What would settle it

A manual audit of model outputs that passed the automated metrics yet contained repeated structural errors, or a new model that scores substantially higher on the same tasks while still producing frequent real-world failures in software tools.

read the original abstract

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and \textbf{2)} conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps-even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces StructEval, a benchmark for LLMs' structured output generation covering 18 formats (non-renderable such as JSON/YAML/CSV and renderable such as HTML/React/SVG) and 44 task types. It evaluates two paradigms—generation from natural language prompts and conversion between formats—using novel metrics for format adherence and structural correctness. Key empirical findings include an average score of 75.58 for o1-mini, an approximately 10-point gap for open-source models, greater difficulty for generation versus conversion tasks, and higher challenge for visual content than text-only structures.

Significance. If the metrics hold up under scrutiny, the work highlights concrete limitations in current LLMs for producing reliable structured outputs, a capability central to software engineering tasks such as code generation, data serialization, and UI prototyping. The broad coverage across renderable and non-renderable formats and the reported task-type differences could inform targeted improvements in model alignment and prompting; the benchmark itself may become a reusable resource provided its scoring procedures are shown to be robust.

major comments (2)
  1. [Section 3] Section 3 (Benchmark and Metrics): The novel metrics for format adherence and structural correctness are defined without any reported validation against human judgments, inter-annotator agreement, or comparison to existing parsers/tools. This is load-bearing for the central claims because the headline gaps (o1-mini at 75.58, generation vs. conversion, visual vs. text-only) rest on these metrics correctly scoring all 18 formats and 44 task types, including edge cases in renderable outputs such as semantically equivalent but syntactically varied React components or SVG path orderings.
  2. [Results section] Results section / Table reporting model scores: The specific numeric results and performance-gap claims are presented without accompanying details on exclusion criteria for invalid outputs, handling of partial parses, or statistical significance tests. This leaves the reported differences (including the ~10-point open-source gap) only partially supported and difficult to interpret as robust model-capability differences rather than metric artifacts.
minor comments (2)
  1. [Abstract] The abstract states 'approximately 10 points behind' without naming the open-source models or providing the exact delta; the main results table should make these values explicit for immediate readability.
  2. [Section 2] Notation for the two paradigms (generation vs. conversion) and the 44 task types could be clarified with a single summary table early in the paper to help readers track which formats map to which evaluation setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and have incorporated revisions to provide greater transparency on metric design and result reporting.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Benchmark and Metrics): The novel metrics for format adherence and structural correctness are defined without any reported validation against human judgments, inter-annotator agreement, or comparison to existing parsers/tools. This is load-bearing for the central claims because the headline gaps (o1-mini at 75.58, generation vs. conversion, visual vs. text-only) rest on these metrics correctly scoring all 18 formats and 44 task types, including edge cases in renderable outputs such as semantically equivalent but syntactically varied React components or SVG path orderings.

    Authors: We thank the referee for this observation. The metrics in Section 3 rely on format-specific parsers (json.loads and PyYAML for text formats; AST-based comparison for React; path-normalized tree matching for SVG). For structural correctness we use normalized tree-edit distance on hierarchical representations and Levenshtein distance on linearized structures. While the submitted manuscript did not report a formal human validation study, we conducted an internal check on 150 sampled outputs spanning all format categories, finding 87% agreement with expert annotators on correctness labels. We will add this analysis plus inter-annotator agreement statistics (Cohen’s kappa) and direct comparisons against standard libraries in the revised Section 3. For semantically equivalent React and SVG cases, the metric canonicalizes identifiers and checks functional equivalence via dependency graphs before scoring. revision: yes

  2. Referee: [Results section] Results section / Table reporting model scores: The specific numeric results and performance-gap claims are presented without accompanying details on exclusion criteria for invalid outputs, handling of partial parses, or statistical significance tests. This leaves the reported differences (including the ~10-point open-source gap) only partially supported and difficult to interpret as robust model-capability differences rather than metric artifacts.

    Authors: We agree that explicit procedural details strengthen interpretability. In the revised results section we will add: (i) exclusion criteria—completely unparseable outputs receive zero adherence score and their frequency is reported per model (under 4% for o1-mini); (ii) partial-parse handling—our pipeline extracts the largest valid subtree or prefix and scores proportionally; (iii) statistical support—bootstrap 95% confidence intervals and Wilcoxon signed-rank tests (p < 0.001) confirming the reported gaps, including the approximately 10-point difference between open- and closed-source models. These additions demonstrate that the differences reflect model behavior rather than metric artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark scores on newly introduced tasks

full rationale

The paper introduces StructEval as a benchmark with 18 formats and 44 task types, then reports direct LLM performance measurements (e.g., o1-mini at 75.58 average) using novel metrics for format adherence and structural correctness. No equations, derivations, or parameter fits are presented; results are independent evaluations rather than reductions to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on external model testing against the benchmark definition, which is self-contained and falsifiable outside any internal loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical benchmark and reports model scores rather than deriving results from mathematical axioms or fitting free parameters; no invented entities are postulated.

pith-pipeline@v0.9.0 · 5770 in / 1227 out tokens · 67315 ms · 2026-05-19T12:58:12.391800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

    cs.CL 2026-04 accept novelty 7.0

    SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

  2. AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

    cs.CL 2026-04 unverdicted novelty 6.0

    AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

  3. Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis

    cs.SE 2026-04 unverdicted novelty 6.0

    A structured JSON intermediate representation for LLM-generated static analysis queries outperforms both direct generation and agentic tool use, with gains of 15-25 percentage points on large models.

  4. Access Paths for Efficient Ordering with Large Language Models

    cs.DB 2025-08 unverdicted novelty 6.0

    Introduces the LLM ORDER BY semantic operator with algorithmic improvements, a semantic-aware external merge sort, and a budget-aware optimizer that selects near-optimal access paths for LLM-based ordering.