pith. machine review for the scientific record. sign in

hub

Rethinking benchmark and contamination for language models with rephrased samples

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

hub tools

years

2026 11 2024 1

clear filters

representative citing papers

Measuring AI Reasoning: A Guide for Researchers

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

citing papers explorer

Showing 3 of 3 citing papers after filters.