pith. sign in

hub

Jsonschemabench: A rigorous benchmark of structured outputs for language models

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

hub tools

citation-role summary

background 2 dataset 1

citation-polarity summary

years

2026 19

clear filters

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

Diagnosing CFG Interpretation in LLMs

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.

Teaching an Agent to Sketch One Part at a Time

cs.AI · 2026-03-19 · unverdicted · novelty 6.0

A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.

Source-Grounded Data Generation for Text-to-JSON Learning

cs.CL · 2026-06-18 · unverdicted · novelty 5.0

STAGE generates source-grounded text-to-JSON training data via spreadsheet validation, raising Qwen3-4B exact match from 31.37% to 74.27% on the 851-example STAGE-Eval benchmark.

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

cs.AI · 2026-05-11 · unverdicted · novelty 5.0

Financial AI systems using tabular models, graph networks, and LLM agents exhibit nondeterminism that undermines reproducibility, quantified via experiments on public datasets and addressed by a proposed layered evaluation framework linking metrics to audit readiness.

Large Databases Need Small, Open-Weight Language Models

cs.AI · 2026-06-30 · unverdicted · novelty 4.0

Quantized open-weight LMs on consumer hardware match closed-source API accuracy for LM-enhanced relational operators while delivering 390x lower cost and 3.8x lower latency in the BlendSQL framework.

citing papers explorer

Showing 8 of 8 citing papers after filters.

  • VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? cs.AI · 2026-05-07 · unverdicted · none · ref 16

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

  • TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data cs.AI · 2026-04-30 · unverdicted · none · ref 40

    TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.

  • Diagnosing CFG Interpretation in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 29

    LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.

  • Teaching an Agent to Sketch One Part at a Time cs.AI · 2026-03-19 · unverdicted · none · ref 12

    A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.

  • It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers cs.AI · 2026-05-26 · unverdicted · none · ref 2

    A 432-run experiment across capability tiers refutes the assumption of a monotone inverse relationship between LLM capability and optimal harness complexity, showing model-type-specific patterns instead.

  • SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks cs.AI · 2026-05-13 · conditional · none · ref 5

    SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.

  • From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems cs.AI · 2026-05-11 · unverdicted · none · ref 42

    Financial AI systems using tabular models, graph networks, and LLM agents exhibit nondeterminism that undermines reproducibility, quantified via experiments on public datasets and addressed by a proposed layered evaluation framework linking metrics to audit readiness.

  • Large Databases Need Small, Open-Weight Language Models cs.AI · 2026-06-30 · unverdicted · none · ref 14

    Quantized open-weight LMs on consumer hardware match closed-source API accuracy for LM-enhanced relational operators while delivering 390x lower cost and 3.8x lower latency in the BlendSQL framework.