Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
arXiv preprint arXiv:2410.21779 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Formal structure boosts LLM accuracy on legal entailment but does not produce faithful reasoning, with scope laundering and other failures persisting across models on ContractNLI.
citing papers explorer
-
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
-
Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning
Formal structure boosts LLM accuracy on legal entailment but does not produce faithful reasoning, with scope laundering and other failures persisting across models on ContractNLI.