hub

Jsonschemabench: A rigorous benchmark of structured outputs for language models

Jsonschemabench: A rigorous benchmark of structured outputs for language models , author= · 2025 · arXiv 2501.10868

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

cs.CV · 2026-03-16 · accept · novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

cs.CL · 2026-04-28 · accept · novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

cs.IR · 2026-02-16 · unverdicted · novelty 7.0

ScrapeGraphAI-100k releases 93,695 real telemetry examples pairing web page content with prompts, schemas, and LLM responses to support training and benchmarking of schema-constrained generation.

Mitigating Bias in Locally Constrained Decoding via Tractable Proposals

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

Introduces (P-)GCD proposals via tensorized automata for SMC sampling that converge faster to target distributions than LCD baselines on function calling, keyword, and SQL tasks.

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

cs.CL · 2026-05-04 · conditional · novelty 6.0

AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.

Diagnosing CFG Interpretation in LLMs

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.

ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control

cs.LG · 2026-03-29 · unverdicted · novelty 6.0

ATLAS-RTC raises first-attempt success on structured LLM generation and tool calling by 20-37.8 points through closed-loop token-level interventions.

Teaching an Agent to Sketch One Part at a Time

cs.AI · 2026-03-19 · unverdicted · novelty 6.0

A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.

Source-Grounded Data Generation for Text-to-JSON Learning

cs.CL · 2026-06-18 · unverdicted · novelty 5.0

STAGE generates source-grounded text-to-JSON training data via spreadsheet validation, raising Qwen3-4B exact match from 31.37% to 74.27% on the 851-example STAGE-Eval benchmark.

Empirical Study for Structured Output Control in LLMs for Software Engineering

cs.SE · 2026-06-08 · conditional · novelty 5.0

Empirical benchmarks on four SE tasks show grammar-constrained decoding and TTMG eliminate most syntax errors in LLM outputs while structural and semantic errors persist and cascade in downstream tools.

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

cs.AI · 2026-05-26 · unverdicted · novelty 5.0

A 432-run experiment across capability tiers refutes the assumption of a monotone inverse relationship between LLM capability and optimal harness complexity, showing model-type-specific patterns instead.

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

cs.CV · 2026-05-13 · unverdicted · novelty 5.0 · 2 refs

ProtoMedAgent formalizes multimodal clinical reporting as iterative zero-gradient test-time optimization over a neuro-symbolic bottleneck with k-anonymity and ℓ-diversity privacy gate, reporting 91.2% faithfulness versus 46.2% for standard RAG on a 4,160-patient cohort.

SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

cs.AI · 2026-05-13 · conditional · novelty 5.0

SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

cs.AI · 2026-05-11 · unverdicted · novelty 5.0

Financial AI systems using tabular models, graph networks, and LLM agents exhibit nondeterminism that undermines reproducibility, quantified via experiments on public datasets and addressed by a proposed layered evaluation framework linking metrics to audit readiness.

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.

Large Databases Need Small, Open-Weight Language Models

cs.AI · 2026-06-30 · unverdicted · novelty 4.0

Quantized open-weight LMs on consumer hardware match closed-source API accuracy for LM-enhanced relational operators while delivering 390x lower cost and 3.8x lower latency in the BlendSQL framework.

citing papers explorer

Showing 8 of 8 citing papers after filters.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? cs.AI · 2026-05-07 · unverdicted · none · ref 16
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data cs.AI · 2026-04-30 · unverdicted · none · ref 40
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
Diagnosing CFG Interpretation in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 29
LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.
Teaching an Agent to Sketch One Part at a Time cs.AI · 2026-03-19 · unverdicted · none · ref 12
A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.
It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers cs.AI · 2026-05-26 · unverdicted · none · ref 2
A 432-run experiment across capability tiers refutes the assumption of a monotone inverse relationship between LLM capability and optimal harness complexity, showing model-type-specific patterns instead.
SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks cs.AI · 2026-05-13 · conditional · none · ref 5
SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.
From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems cs.AI · 2026-05-11 · unverdicted · none · ref 42
Financial AI systems using tabular models, graph networks, and LLM agents exhibit nondeterminism that undermines reproducibility, quantified via experiments on public datasets and addressed by a proposed layered evaluation framework linking metrics to audit readiness.
Large Databases Need Small, Open-Weight Language Models cs.AI · 2026-06-30 · unverdicted · none · ref 14
Quantized open-weight LMs on consumer hardware match closed-source API accuracy for LM-enhanced relational operators while delivering 390x lower cost and 3.8x lower latency in the BlendSQL framework.

Jsonschemabench: A rigorous benchmark of structured outputs for language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer