Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24858–24866
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
FormalScience provides a scalable human-in-the-loop system for autoformalising scientific reasoning into Lean, demonstrated on a new 200-problem physics dataset with perfect formal validity.
citing papers explorer
-
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
-
FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean
FormalScience provides a scalable human-in-the-loop system for autoformalising scientific reasoning into Lean, demonstrated on a new 200-problem physics dataset with perfect formal validity.