LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.
G., Riols, F
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
A multi-dimensional framework with six dimensions (Correctness, Consistency, Robustness, Logical Coherence, Efficiency, Stability) is applied to seven LLMs on 975 items, revealing orthogonality between logical coherence and correctness plus ranking inversions invisible to accuracy metrics.
citing papers explorer
-
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.
-
Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework
A multi-dimensional framework with six dimensions (Correctness, Consistency, Robustness, Logical Coherence, Efficiency, Stability) is applied to seven LLMs on 975 items, revealing orthogonality between logical coherence and correctness plus ranking inversions invisible to accuracy metrics.