pith. sign in

arxiv: 2602.17234 · v2 · pith:VNQV46QInew · submitted 2026-02-19 · 💻 cs.AI · cs.LG

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection and Mitigation in LLM Backtesting

classification 💻 cs.AI cs.LG
keywords textbfbacktestingclaim-levelcountfurtherinterpretableknowledgellms
0
0 comments X
read the original abstract

Backtesting LLMs on resolved events assumes models reason only from pre-cutoff knowledge, yet pretrained models inevitably leak post-cutoff knowledge. We introduce a claim-level evaluation framework that decomposes prediction rationales into atomic claims and applies Shapley values to quantify each claim's decision impact, yielding \textbf{Shapley-DCLR} (\textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate) -- an interpretable metric measuring what fraction of decision-driving reasoning is contaminated. We further propose \textbf{TimeSPEC} (\textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims), an inference-time architecture that interleaves temporally-filtered retrieval with claim-level supervision, producing predictions grounded entirely in pre-cutoff evidence. Across three LLMs, the ablation experiments confirm retrieval and supervision are jointly necessary; and a three-task probe further illstrates that the performance cost of temporal enforcement scales with each task's reliance on post-cutoff information.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.