VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
hub
Jsonschemabench: A rigorous benchmark of structured outputs for language models
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 14representative citing papers
VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
ScrapeGraphAI-100k releases 93,695 real telemetry examples pairing web page content with prompts, schemas, and LLM responses to support training and benchmarking of schema-constrained generation.
AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.
ATLAS-RTC raises first-attempt success on structured LLM generation and tool calling by 20-37.8 points through closed-loop token-level interventions.
A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.
A 432-run experiment across capability tiers refutes the assumption of a monotone inverse relationship between LLM capability and optimal harness complexity, showing model-type-specific patterns instead.
ProtoMedAgent formalizes multimodal clinical reporting as iterative zero-gradient test-time optimization over a neuro-symbolic bottleneck with k-anonymity and ℓ-diversity privacy gate, reporting 91.2% faithfulness versus 46.2% for standard RAG on a 4,160-patient cohort.
SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.
DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.
citing papers explorer
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
-
VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents
VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.
-
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
-
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
-
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
-
ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation
ScrapeGraphAI-100k releases 93,695 real telemetry examples pairing web page content with prompts, schemas, and LLM responses to support training and benchmarking of schema-constrained generation.
-
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
-
Diagnosing CFG Interpretation in LLMs
LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.
-
ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control
ATLAS-RTC raises first-attempt success on structured LLM generation and tool calling by 20-37.8 points through closed-loop token-level interventions.
-
Teaching an Agent to Sketch One Part at a Time
A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.
-
It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
A 432-run experiment across capability tiers refutes the assumption of a monotone inverse relationship between LLM capability and optimal harness complexity, showing model-type-specific patterns instead.
-
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
ProtoMedAgent formalizes multimodal clinical reporting as iterative zero-gradient test-time optimization over a neuro-symbolic bottleneck with k-anonymity and ℓ-diversity privacy gate, reporting 91.2% faithfulness versus 46.2% for standard RAG on a 4,160-patient cohort.
-
SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks
SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.
-
DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models
DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.