{"total":18,"items":[{"citing_arxiv_id":"2606.31808","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Databases Need Small, Open-Weight Language Models","primary_cat":"cs.AI","submitted_at":"2026-06-30T15:25:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Quantized open-weight LMs on consumer hardware match closed-source API accuracy for LM-enhanced relational operators while delivering 390x lower cost and 3.8x lower latency in the BlendSQL framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09395","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Empirical Study for Structured Output Control in LLMs for Software Engineering","primary_cat":"cs.SE","submitted_at":"2026-06-08T12:13:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empirical benchmarks on four SE tasks show grammar-constrained decoding and TTMG eliminate most syntax errors in LLM outputs while structural and semantic errors persist and cascade in downstream tools.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01926","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mitigating Bias in Locally Constrained Decoding via Tractable Proposals","primary_cat":"cs.CL","submitted_at":"2026-06-01T08:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces (P-)GCD proposals via tensorized automata for SMC sampling that converge faster to target distributions than LCD baselines on function calling, keyword, and SQL tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26731","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers","primary_cat":"cs.AI","submitted_at":"2026-05-26T09:08:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 432-run experiment across capability tiers refutes the assumption of a monotone inverse relationship between LLM capability and optimal harness complexity, showing model-type-specific patterns instead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14113","ref_index":12,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows","primary_cat":"cs.CV","submitted_at":"2026-05-13T20:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ProtoMedAgent formalizes multimodal clinical reporting as iterative zero-gradient test-time optimization over a neuro-symbolic bottleneck with k-anonymity and ℓ-diversity privacy gate, reporting 91.2% faithfulness versus 46.2% for standard RAG on a 4,160-patient cohort.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14051","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks","primary_cat":"cs.AI","submitted_at":"2026-05-13T19:12:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23955","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems","primary_cat":"cs.AI","submitted_at":"2026-05-11T17:46:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Financial AI systems using tabular models, graph networks, and LLM agents exhibit nondeterminism that undermines reproducibility, quantified via experiments on public datasets and addressed by a proposed layered evaluation framework linking metrics to audit readiness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08737","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs","primary_cat":"cs.LG","submitted_at":"2026-05-09T06:48:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026. [12] Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. Jsonschemabench: A rigorous benchmark of structured outputs for language models.arXiv preprint arXiv:2501.10868, 2025. [13] Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ. [14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503."},{"citing_arxiv_id":"2605.16342","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T01:02:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06068","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?","primary_cat":"cs.AI","submitted_at":"2026-05-07T11:54:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In other words, generic abstractions impose aportability taxon non-standard models, hardware, and applications [12, 49]. Building bespoke systems can solve this problem. For example, knowing workload characteristics at design time can enable optimizations that a workload-agnostic runtime cannot safely assume. RAG-like applications with long shared prefixes can amortize prefill through prompt caching [16, 30], while aggressive speculative decoding based on predicted outputs is possible for some applications like code editing [54, 13, 74, 66]. Similarly, tailoring for a particular model architecture can expose state and execution patterns that fall outside standard decoder-only assumptions. As an example, hybrid state-space/attention models require cache-management strategies different from those used for"},{"citing_arxiv_id":"2605.02363","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-04T09:07:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00060","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data","primary_cat":"cs.AI","submitted_at":"2026-04-30T03:19:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25359","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-28T08:27:01+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"leaf-path maps of ground truth and prediction, with G=paths(G) , P=paths(P) , and O=G∩P . We report seven metrics (Table 1), all defined below with key equations. Full mathematical definitions for all seven are in Appendix C. Table 1: Summary of the seven evaluation metrics. All are per-record, then averaged across the set. Metric Range What it measures JSON Pass Rate{0,1}Parse + structured root + schema validates Faithfulness[0,1]Soft value match (token F1) Path Recall[0,1]Structural completeness Structure Coverage[0,1]Structural precision×recall (F1) Type Safety[0,1]JSON type correctness Perfect Response{0,1}Exact full-object match Value Accuracy[0,1]Exact leaf-value match (primary) Value Accuracyis the primary metric. It measures the fraction of ground-truth leaf paths where the"},{"citing_arxiv_id":"2604.20811","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diagnosing CFG Interpretation in LLMs","primary_cat":"cs.AI","submitted_at":"2026-04-22T17:43:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.27905","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control","primary_cat":"cs.LG","submitted_at":"2026-03-29T23:28:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ATLAS-RTC raises first-attempt success on structured LLM generation and tool calling by 20-37.8 points through closed-loop token-level interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.19500","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Teaching an Agent to Sketch One Part at a Time","primary_cat":"cs.AI","submitted_at":"2026-03-19T22:08:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.15118","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents","primary_cat":"cs.CV","submitted_at":"2026-03-16T11:15:56+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.15189","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation","primary_cat":"cs.IR","submitted_at":"2026-02-16T20:56:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ScrapeGraphAI-100k releases 93,695 real telemetry examples pairing web page content with prompts, schemas, and LLM responses to support training and benchmarking of schema-constrained generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}