AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.
citing papers explorer
-
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
-
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
-
MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs
MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.