Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
Large language model reasoning failures.arXiv preprint arXiv:2602.06176
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Frontier LLMs from late 2025 reach near-perfect scores on text-based physics problem solving and show improved human-grading alignment, yet still struggle to assign partial credit for flawed reasoning.
citing papers explorer
-
Interactive Evaluation Requires a Design Science
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
-
Using Large Language Models in Physics Education
Frontier LLMs from late 2025 reach near-perfect scores on text-based physics problem solving and show improved human-grading alignment, yet still struggle to assign partial credit for flawed reasoning.