Introduces LCS-Bench, a theory-scale benchmark covering 327 textbook items and 4,076 Lean declarations, with evaluations showing state-of-the-art models reach only 20.1% on auto-formalization tasks.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Presents a distributionally robust optimization method for sound probabilistic verification of Datalog policies in AI agents that bounds violation risk regardless of predicate correlations.
VFR-LLM combines small LLMs with symbolic verification and solving to reach 0.983 and 0.933 accuracy on precedence and logical deduction tasks using one model call versus lower results from self-consistency baselines.
citing papers explorer
-
Theory-Scale Auto-Formalization of Logics for Computer Science
Introduces LCS-Bench, a theory-scale benchmark covering 327 textbook items and 4,076 Lean declarations, with evaluations showing state-of-the-art models reach only 20.1% on auto-formalization tasks.
-
Efficient and Sound Probabilistic Verification for AI Agents
Presents a distributionally robust optimization method for sound probabilistic verification of Datalog policies in AI agents that bounds violation risk regardless of predicate correlations.
-
Resource-Aware Neuro-Symbolic Reasoning for Local Small Language Models
VFR-LLM combines small LLMs with symbolic verification and solving to reach 0.983 and 0.933 accuracy on precedence and logical deduction tasks using one model call versus lower results from self-consistency baselines.