arxiv: 2604.06997 · v1 · submitted 2026-04-08 · 💻 cs.CL

Recognition: no theorem link

ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

Yihao Wang , Zijian He , Jie Ren , Keze Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords temporal retrievalClassical ChineseSpring and Autumn Annalsdual-encoderretrieval-augmented generationhistorical RAGtime-keyed evaluationregnal dating

0 comments

The pith

A time-aware dual-encoder outperforms semantic baselines when retrieving month-specific records from Classical Chinese annals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether retrieval systems can locate the precise record tied to a given regnal month in terse historical texts where dates are expressed indirectly. Standard meaning-based models often return passages from nearby months that sound relevant but belong to the wrong time. The authors build ChunQiuTR, a benchmark that organizes the Spring and Autumn Annals by month-level reign keys and adds realistic near-miss examples based on temporal proximity. They then introduce a dual-encoder that injects calendar-derived time signals into the representations. Experiments show steady gains over strong semantic baselines, indicating that temporal consistency must be enforced at retrieval time for downstream historical applications to stay accurate.

Core claim

ChunQiuTR supplies month-level reign keys derived from surrounding context together with chrono-near confounders drawn from the Spring and Autumn Annals and its exegetical tradition. The Calendrical Temporal Dual-encoder (CTD) combines Fourier-based absolute calendrical context with relative offset biasing, yielding consistent improvements over semantic dual-encoder baselines under time-keyed evaluation and thereby establishing retrieval-time temporal consistency as a prerequisite for faithful historical RAG.

What carries the argument

The Calendrical Temporal Dual-encoder (CTD), which fuses Fourier encodings of absolute calendar positions with relative date-offset biases inside a dual-encoder to enforce both global temporal structure and local consistency.

If this is right

Semantic similarity alone fails for queries that require an exact regnal-month record because semantically close passages can still be chronologically invalid.
Benchmarks for historical retrieval must include month-level keys and temporal confounders to expose failures hidden by topical-only evaluation.
Models that receive explicit calendrical signals produce higher-fidelity passages for downstream RAG systems that synthesize historical timelines.
The same time-keyed evaluation protocol can be applied to other annals whose dating relies on implicit reign phrases.
Consistent gains under time-keyed metrics indicate that temporal encoding should be a standard component rather than an optional add-on for classical-text retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark construction could be reused to create time-sensitive test sets for legal codes or medical case histories that also carry implicit dating.
Integrating the CTD-style signals into larger language models might lower the rate of timeline hallucinations when those models answer questions drawn from ancient sources.
Applying the same Fourier-plus-offset approach to translated or modernized versions of the annals would test whether the gains survive when the original terse phrasing is expanded.
If the method generalizes, it could be combined with existing cross-lingual retrievers to support multilingual historical RAG without requiring full Gregorian conversion of every date.

Load-bearing premise

The chrono-near negative examples constructed for the benchmark accurately reflect the temporal mistakes that real retrieval systems would make on queries about the annals.

What would settle it

A controlled experiment in which human historians label temporal relevance for the same queries and the time-aware model shows no advantage over semantic baselines on those labels would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.06997 by Jie Ren, Keze Wang, Yihao Wang, Zijian He.

**Figure 2.** Figure 2: Overview of the ChunQiuTR construction. ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our Calendrical Temporal Dual-encoder (CTD). A shared Transformer dual-encoder encodes [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of Qualitative Examples. month. Our CTD models achieve the best performance in both regimes, with especially large gains on multi-month queries for the BERT backbone (roughly +0.16 R@1) and consistent improvements for Qwen3-Embed, indicating better temporal ordering under chrono-near confounds. 5.3.2 Qualitative Examples [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Example of event-level grouping under the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: LLM-assisted reverse matching from para [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Representative temporal query types in ChunQiuTR, including point queries, past/future windows, around [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Month-level coverage, gap counts, and gap ratios for the normalized [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Top-1 near-miss cases on ChunQiuTR (test set). We show the top–5 results from a baseline (left) and ours [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Online LLM without evidence grounding on a month-level point query from ChunQiuTR. For the query “鲁隐公元年十二月”, the gold month contains two entries (祭伯来 and 公子益师卒). Without evidence, the model either predicts an empty month or returns an incomplete answer. sources, the queries are instantiated from a small set of manually written templates, and LLMs are used only to propose candidate splits or candidate ali… view at source ↗

**Figure 11.** Figure 11: Evidence-bounded RAG for the same query as [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

Retrieval shapes how language models access and ground knowledge in retrieval-augmented generation (RAG). In historical research, the target is often not an arbitrary relevant passage, but the exact record for a specific regnal month, where temporal consistency matters as much as topical relevance. This is especially challenging for Classical Chinese annals, where time is expressed through terse, implicit, non-Gregorian reign phrases that must be interpreted from surrounding context, so semantically plausible evidence can still be temporally invalid. We introduce \textbf{ChunQiuTR}, a time-keyed retrieval benchmark built from the \textit{Spring and Autumn Annals} and its exegetical tradition. ChunQiuTR organizes records by month-level reign keys and includes chrono-near confounders that mirror realistic retrieval failures. We further propose \textbf{CTD} (Calendrical Temporal Dual-encoder), a time-aware dual-encoder that combines Fourier-based absolute calendrical context with relative offset biasing. Experiments show consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, supporting retrieval-time temporal consistency as a key prerequisite for faithful downstream historical RAG. Our code and datasets are available at \href{https://github.com/xbdxwyh/ChunQiuTR}{\texttt{github.com/xbdxwyh/ChunQiuTR}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChunQiuTR gives a focused benchmark and simple time-encoding tweak for Classical Chinese retrieval, but the gains rest on unvalidated assumptions about how the test cases were built.

read the letter

The main thing to know is that the paper builds ChunQiuTR, a retrieval benchmark drawn from the Spring and Autumn Annals and its commentaries, organized around month-level reign keys, and pairs it with CTD, a dual-encoder that adds Fourier features for absolute calendrical position plus relative offset terms. They show the model beats plain semantic baselines on time-keyed metrics and release the code and data on GitHub. That combination of a new domain-specific test set and reproducible assets is the concrete contribution. The setup makes sense for a corpus where time is expressed through terse, implicit reign phrases rather than Gregorian dates, and the chrono-near negatives target a real failure mode in historical RAG. The Fourier-plus-offset design is a straightforward way to inject calendrical structure without overhauling the encoder. On the soft side, the central claim that retrieval-time temporal consistency is a prerequisite for faithful historical RAG depends on the confounders actually mirroring the temporal mistakes that occur with genuine queries. If the month-level keys were derived using full surrounding annals context that a downstream query would not have, or if the near-confounders were constructed with explicit signals the model is explicitly trained to exploit, then the observed gains may not generalize. The paper would be stronger with an explicit check, such as expert review of the test cases or an ablation showing performance when those signals are removed. This is for people working on temporal or historical retrieval in non-Western texts rather than general RAG. It is narrow but grounded, and the public assets make it straightforward to inspect. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces ChunQiuTR, a time-keyed retrieval benchmark derived from the Spring and Autumn Annals and its exegetical tradition, which organizes records by month-level reign keys and incorporates chrono-near confounders designed to simulate realistic temporal retrieval failures in Classical Chinese texts. It proposes the CTD (Calendrical Temporal Dual-encoder) model that augments a dual-encoder with Fourier-based absolute calendrical context and relative offset biasing. Experiments are reported to show consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, from which the authors conclude that retrieval-time temporal consistency is a prerequisite for faithful downstream historical RAG. Code and datasets are released publicly.

Significance. If the benchmark construction and evaluation protocol hold, the work would usefully highlight a gap in semantic retrieval for temporally structured historical corpora and provide a reproducible testbed for time-aware encoders in non-Gregorian, context-dependent settings. The public code release and focus on Classical Chinese annals are concrete strengths that could support follow-on work in historical RAG and temporal IR.

major comments (3)

[Section 3] Section 3 (Benchmark Construction): the month-level reign keys are stated to be derived from surrounding context, yet the manuscript provides no explicit check that this derivation uses only information that would be available to a downstream RAG query; if full-annals context is required for key assignment, the reported gains may not transfer to realistic query settings.
[Sections 4 and 5] Section 4 (CTD Model) and Section 5 (Experiments): the central claim that temporal consistency is a 'key prerequisite' rests on the chrono-near confounders accurately mirroring real retrieval failures, but the text supplies no independent validation (expert annotation, historical error analysis, or comparison against attested semantic-temporal mismatches) that the confounders avoid leaking explicit calendrical signals that the Fourier/offset components can exploit but genuine queries lack.
[Section 5] Section 5 (Experiments): the abstract asserts 'consistent gains' over semantic baselines, but the manuscript text does not report exact metrics (e.g., recall@K, MRR), baseline implementation details, ablation results isolating the Fourier and offset terms, or statistical significance; without these, the magnitude and robustness of the temporal-consistency effect cannot be assessed.

minor comments (2)

[Section 2] Notation for reign keys and Fourier components could be introduced with a small table or diagram for readers unfamiliar with Classical Chinese calendrical conventions.
[Section 2] The related-work section should cite prior temporal IR and historical RAG papers (e.g., on event-based retrieval or calendar-aware embeddings) to better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments highlight important aspects of benchmark validity and experimental reporting that we will address to improve clarity and rigor. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Section 3] Section 3 (Benchmark Construction): the month-level reign keys are stated to be derived from surrounding context, yet the manuscript provides no explicit check that this derivation uses only information that would be available to a downstream RAG query; if full-annals context is required for key assignment, the reported gains may not transfer to realistic query settings.

Authors: We agree that an explicit check is necessary to ensure transferability. In ChunQiuTR construction, reign keys are derived solely from the explicit reign phrases and the immediately surrounding sentences within each individual record (typically 1-3 sentences), which directly corresponds to the information present in a standard RAG query. No global annals-level context is used. We will add a dedicated verification subsection to Section 3 that formalizes this process, provides concrete examples of key assignment, and confirms locality to query-available context only. revision: yes
Referee: [Sections 4 and 5] Section 4 (CTD Model) and Section 5 (Experiments): the central claim that temporal consistency is a 'key prerequisite' rests on the chrono-near confounders accurately mirroring real retrieval failures, but the text supplies no independent validation (expert annotation, historical error analysis, or comparison against attested semantic-temporal mismatches) that the confounders avoid leaking explicit calendrical signals that the Fourier/offset components can exploit but genuine queries lack.

Authors: This concern about potential leakage is well-taken and directly affects the strength of our central claim. The chrono-near confounders were generated by first retrieving semantically similar records via embedding similarity and then selecting those with adjacent reign periods, without injecting any explicit calendrical metadata into the selection. We will expand Section 5 with a detailed construction protocol and rationale explaining why this process avoids leaking absolute temporal signals (e.g., no use of reign-year numbers or month offsets during confounder selection). While we did not conduct separate expert annotation or historical error analysis in the submitted version, the design follows documented patterns of temporal ambiguity in the Spring and Autumn exegetical tradition. We will also add a limitations paragraph noting the absence of external validation and its implications. revision: partial
Referee: [Section 5] Section 5 (Experiments): the abstract asserts 'consistent gains' over semantic baselines, but the manuscript text does not report exact metrics (e.g., recall@K, MRR), baseline implementation details, ablation results isolating the Fourier and offset terms, or statistical significance; without these, the magnitude and robustness of the temporal-consistency effect cannot be assessed.

Authors: We acknowledge that the main text should explicitly state the quantitative results rather than deferring entirely to tables. All requested details—exact recall@K and MRR values across settings, baseline implementations (standard dual-encoder with identical backbone and training), ablation results for the Fourier component and offset biasing in isolation, and statistical significance (paired t-tests with p-values)—are already computed and appear in Tables 2–4 plus the appendix. We will revise Section 5 to incorporate these numbers and implementation specifics directly into the narrative, along with a brief discussion of effect sizes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and comparative evaluation are independent of claimed results

full rationale

The paper introduces ChunQiuTR as a new time-keyed benchmark derived from the Spring and Autumn Annals and proposes the CTD dual-encoder with explicit Fourier calendrical and offset components. Claims rest on experimental gains versus semantic baselines under time-keyed metrics. No equations, parameter fits, or self-citations are shown to reduce the reported improvements or the 'temporal consistency prerequisite' conclusion to tautological definitions or inputs by construction. The derivation chain consists of independent dataset curation followed by standard retrieval evaluation, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard dual-encoder training assumptions and the manual construction of time keys and confounders from the Spring and Autumn Annals exegetical tradition; no free parameters or invented entities are introduced beyond conventional model components.

axioms (1)

domain assumption Semantic similarity plus explicit temporal features will improve retrieval when time consistency is required.
Invoked in the description of CTD and the time-keyed evaluation setup.

pith-pipeline@v0.9.0 · 5539 in / 1301 out tokens · 45508 ms · 2026-05-10T18:09:20.537426+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Retrieval-Augmented Generation for Large Language Models: A Survey

XLTime: A cross-lingual knowledge transfer framework for temporal expression extraction. In Findings of the Association for Computational Lin- guistics: NAACL 2022, pages 1931–1942, Seattle, United States. Association for Computational Lin- guistics. Alejandro Sánchez de Castro, Lourdes Araujo, and Juan Martinez-Romo. 2025. A novel methodology for enhanci...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

2016 Conference on Language Technolo- gies and Digital Humanities

Temporal information retrieval via time- specifier model merging. InProceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM), pages 1–13, Vienna, Austria. As- sociation for Computational Linguistics. Rohan Jha, Bo Wang, Michael Günther, Georgios Mas- trapas, Saba Sturua, Isabelle Mohr, Andreas Kouk- ounas, Mohammad Kalim Wang, N...

2024
[3]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Curran Associates, Inc. Xiaoyan Li and W. Bruce Croft. 2003. Time-based lan- guage models. InProceedings of the Twelfth Inter- national Conference on Information and Knowledge Management, CIKM ’03, page 469–475, New York, NY , USA. Association for Computing Machinery. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. ...

work page internal anchor Pith review arXiv 2003
[4]

寡人有弟不能和协，使糊其口于四方

PhiloGPT: A philology-oriented large lan- guage model for Ancient Chinese manuscripts with dunhuang as case study. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2784–2801, Miami, Florida, USA. Association for Computational Linguistics. Text (excerpt + short gloss) Source Label 夏，五月，郑伯克段于鄢。 Annals anchor:...

work page arXiv 2024
[5]

Yin (L=2) span gap ratio 0 50 100 150 200months
[6]

Huan (L=4) span gap ratio 0 100 200 300 400months
[7]

Zhuang (L=4) span gap ratio 0 5 10 15 20 25months
[8]

Min (L=2) span gap ratio 0 100 200 300 400months
[9]

Xi (L=3) span gap ratio 0 50 100 150 200months
[10]

Wen (L=4) span gap ratio 0 50 100 150 200months
[11]

Xuan (L=4) span gap ratio 0 50 100 150 200months
[12]

Cheng (L=4) span gap ratio 0 100 200 300 400months
[13]

Xiang (L=3) span gap ratio 0 100 200 300 400months
[14]

Zhao (L=4) span gap ratio 0 50 100 150 200months
[15]

Ding (L=4) span gap ratio 0 100 200 300months
[16]

不确定/需要证据

Ai (L=12) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio (%) 0 20 40 60 80 100 gap ratio...

work page arXiv 2009
[17]

证据选择：从 DOC 中选出“最关键”的 1~3 条。
[18]

最终回答：只基于你选的证据作答（中文，2~6 句）。
[19]

证据引用：在答案末尾列出你用到的 DOC 。
[20]

该月无事可书”（以资料措辞为准）。 Thinking: 好的，用户让我作为《春秋》史料问答助手，只依据给定的资料来回答问题。问题是查询鲁隐公元年十二月有什么重要事件。我得先看看资料里这个月份有什么记录。资料中有鲁隐公元年十一月的记录，但那里文本说“经传于是月史事阙如，无专条可录

若问题对应月份为空月：请明确写“该月无事可书”（以资料措辞为准）。 Thinking: 好的，用户让我作为《春秋》史料问答助手，只依据给定的资料来回答问题。问题是查询鲁隐公元年十二月有什么重要事件。我得先看看资料里这个月份有什么记录。资料中有鲁隐公元年十一月的记录，但那里文本说“经传于是月史事阙如，无专条可录”，所以十一月没有事件。接下来是鲁隐公元年十二月，这个月份有两个部分。第一部分是关于“冬，十有二月，祭伯来”的记载，这里有《春秋》经文和三部传的详细内容。第二部分是关于“公子益师卒”的记载，同样有经文和三传的说明。这些看起来都是十二月发生的事件。鲁隐公二年正月的记录是关于“公会戎于潜”的，这已经是第二年的事情了，跟问题中的隐公元年十二月没关系。现在我需要选择证据。问题问的是鲁隐公元年十二月...
[21]

versions

**证据选择**： - 选择鲁隐公元年十二月记录中的两个部分（两个“versions”）。因为该月份资料明确包含两条独立经文记载：“冬，十有二月，祭伯来。”和“公子益师卒。”，并附有三传的注解与讨论，它们直接对应问题所询问的该月重要事件。
[22]

冬，十有二月，祭伯来

**最终回答**：根据《春秋》记载，鲁隐公元年十二月发生了两件重要事件。第一是“冬，十有二月，祭伯来”，即祭伯来到鲁国，《左传》指出此举“非王命也”。第二是“公子益师卒”，即鲁国公子益师去世，《左传》解释其卒不记载具体日期是因为“公不与小敛”。这两条是本月明确记载于经文的事件。
[23]

冬，十有二月，祭伯来

**证据引用**： Evidence: 鲁隐公元年十二月 (version 1, 经文“冬，十有二月，祭伯来”及相关传注)；鲁隐公元年十二月 (version 2，经文“公子益师卒”及相关传注) Output (w/o Thinking):
[24]

证据选择：（魯隱公元年十二月）：这是问题直接对应的月份记录，包含两条经文及三传相关记载，是回答问题的核心依据。
[25]

最终回答：根据《春秋》记载，鲁隐公元年十二月发生了两件重要事件。第一，祭伯来到鲁国，但《左传》指出这是“非王命也”，即并非奉周天子之命。第二，鲁国公子益师去世，但《春秋》未记载其去世的具体日期。因此，该月的重要事件是祭伯来和公子益师卒。
[26]

冬，十有二月，祭伯來。祭伯者何？天子之大夫也。何以不稱使？奔也。…

证据引用： Evidence: "冬，十有二月，祭伯來。祭伯者何？天子之大夫也。何以不稱使？奔也。… " "公子益師卒。" Ground Truth: (1) 祭伯来；(2) 公子益师卒。 Ground Truth coverage 2/2 (correct) Grounded: cites ground truth DOCs Grounded: cites ground truth DOCs Ground truth DOCs Ground Truth coverage 2/2 (correct) Figure 11: Evidence-bounded RAG for the same query as Fig. 10. With a small evidence pack containing the...