Proposes a claim-calibrated evidence ladder and evaluation protocol with explicit-memory baselines to assess whether TTT produces deployment-usable behavioral memory rather than just proxy metric gains.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
Proposes a claim-calibrated evidence ladder and evaluation protocol with explicit-memory baselines to assess whether TTT produces deployment-usable behavioral memory rather than just proxy metric gains.