LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.
G., Riols, F
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it