arxiv: 2604.21748 · v1 · submitted 2026-04-23 · 💻 cs.CL · cs.AI· cs.IR· cs.LG· cs.MA

Recognition: unknown

StructMem: Structured Memory for Long-Horizon Behavior in LLMs

Buqiang Xu , Yijun Chen , Jizhan Fang , Ruobin Zhong , Yunzhi Yao , Yuqi Zhu , Lun Du , Shumin Deng

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LGcs.MA

keywords structured memoryLLM memory systemstemporal reasoningmulti-hop question answeringhierarchical memorylong-horizon agentssemantic consolidationLoCoMo benchmark

0 comments

The pith

StructMem builds hierarchical memory with temporal dual perspectives to link events across long conversations while using fewer tokens and API calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long conversational agents require memory that captures how events relate over time, not just isolated facts. Flat memories stay cheap but miss those links, while graph memories capture structure at high construction cost. StructMem creates a hierarchical structure that keeps event bindings intact and forms cross-event connections by anchoring each event in two temporal views and then periodically merging related semantics. On the LoCoMo benchmark this produces stronger results for questions that depend on time order or span multiple earlier events. The same design also lowers token counts, API calls, and total runtime compared with earlier memory approaches.

Core claim

StructMem is a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections by temporally anchoring dual perspectives and performing periodic semantic consolidation. This yields improved temporal reasoning and multi-hop performance on LoCoMo while substantially reducing token usage, API calls, and runtime relative to prior memory systems.

What carries the argument

The structure-enriched hierarchical memory that temporally anchors dual perspectives on events and performs periodic semantic consolidation to preserve original bindings while forming cross-event links.

If this is right

Agents can answer more questions that require ordering or chaining events separated by many turns.
Memory operations use fewer tokens and trigger fewer API calls during both storage and retrieval.
Overall runtime for long-horizon tasks decreases without sacrificing answer quality.
The system avoids the fragility of explicit graph construction while retaining relational structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring-plus-consolidation pattern could be tested on non-conversational sequences such as long documents or planning traces.
Periodic consolidation might be tuned to different compression rates for memory-constrained devices.
If the method generalizes, it could support coherent state over sessions far longer than current flat or graph approaches allow.
Hybrid systems combining this memory with existing episodic-memory techniques become worth exploring.

Load-bearing premise

That temporally anchoring dual perspectives and periodically consolidating semantics will reliably create accurate cross-event connections without losing critical details or introducing new errors.

What would settle it

A controlled run on LoCoMo multi-hop questions where StructMem either misses a required cross-event link, hallucinates a false connection, or consumes more tokens than a simple flat-memory baseline on the same input length.

Figures

Figures reproduced from arXiv: 2604.21748 by Buqiang Xu, Jizhan Fang, Lun Du, Ruobin Zhong, Shumin Deng, Yijun Chen, Yunzhi Yao, Yuqi Zhu.

**Figure 1.** Figure 1: Three paradigms of Memory systems. as independent units, but fail to preserve crossevent relations, causing retrieval over long histories to degrade into shallow similarity matching (Liu et al., 2023; Zhuang et al., 2026). Graph-based systems (Chhikara et al., 2025; Rasmussen et al., 2025) recover relational structure via entity–relation extraction, but incur high construction cost, require cascaded inf… view at source ↗

**Figure 2.** Figure 2: StructMem’s hierarchical memory organization. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis of efficiency across memory paradigms and internal mechanisms of StructMem. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Factual entry extraction prompt (Part 1). [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Factual entry extraction prompt (Part 2). [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Relational entry extraction prompt (Part 1). [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Relational entry extraction prompt (Part 2). [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Narrative synthesis prompt for Synthesis. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Entity extraction prompt [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Entity deduplicate prompt [SYSTEM]: You extract fact triples between entities from CURRENT_MESSAGES (list of entries). Inputs: FACT_TYPES, PREVIOUS_MESSAGES (for disambiguation only), CURRENT_MESSAGES, ENTITIES (extracted or existing names). Rules: 1) source and destination must be names from ENTITIES and must be distinct. 2) relationship in SCREAMING_SNAKE_CASE (e.g., GREETED, WORKS_AT). Prefer FACT_TYPE… view at source ↗

**Figure 11.** Figure 11: Relation extraction prompt [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Relation deduplicate prompt Prompts for StructMem QA [SYSTEM]: You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories. # CONTEXT: You have access to memories from two speakers in a conversation. These memories contain timestamped information that may be relevant to answering the question. # INSTRUCTIONS: 1. Carefully analyze all provided memories fro… view at source ↗

**Figure 13.** Figure 13: Question answering prompt for StructMem system. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Question answering prompt for flat memory systems. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Question answering prompt for graph-based memory systems. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Evaluate prompt for assessing response quality. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt for assessing extraction fidelity. [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt for assessing consolidation fidelity (Part 1). [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt for assessing consolidation fidelity (Part 2). [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

**Figure 20.** Figure 20: Narrative synthesis prompt for Unconstrained Synthesis. The text highlighted in gray represents the [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

read the original abstract

Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see https://github.com/zjunlp/LightMem .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StructMem adds hierarchical memory with dual temporal views and semantic consolidation, delivering efficiency on LoCoMo but without fidelity checks on the consolidation.

read the letter

The one or two things to know: StructMem proposes a hierarchical memory for LLMs that uses dual temporal anchoring and periodic semantic consolidation to capture event relationships better than flat or graph methods. It shows gains in temporal reasoning and multi-hop performance on the LoCoMo benchmark along with lower token counts, fewer API calls, and faster runtime. What is new is the particular way they combine the hierarchy with those two mechanisms. Prior work either stays flat for speed or builds graphs that are costly to maintain. This approach tries to get some structure without the full graph overhead. They do well at identifying the core problem in long-horizon agent memory and at providing a GitHub repo for the implementation, which makes it easier to test the claims. The efficiency improvements are the strongest part if they hold. Cutting resource use while improving reasoning is exactly what agent builders need. The soft spots center on the consolidation process. The paper relies on this step to induce cross-event connections, but there is no reported measure of how accurately it preserves original details or avoids introducing mistakes. If the LLM summarizer hallucinates or omits key facts, that would undermine both the performance numbers and the claimed savings, since the downstream tasks would run on incomplete or wrong memory. An ablation isolating the consolidation's effect or tracking information fidelity over multiple cycles would have strengthened the case. The evaluation protocol details are also light in the abstract, though the full paper likely expands on baselines and significance. Overall, this is aimed at the subfield of memory architectures for conversational and agentic LLMs. A reader focused on practical scaling of long-context agents would get value from the design choices and the benchmark results. I would recommend sending it for peer review. The work is grounded in a real problem, has empirical backing on a relevant dataset, and the code availability helps. The consolidation reliability is a clear area for revision, but not a reason to desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes StructMem, a structure-enriched hierarchical memory framework for LLMs that temporally anchors dual perspectives and performs periodic semantic consolidation to capture cross-event relationships. It claims this yields improved temporal reasoning and multi-hop QA performance on the LoCoMo benchmark while reducing token usage, API calls, and runtime relative to prior flat and graph-based memory systems.

Significance. If the empirical claims hold under rigorous evaluation, StructMem would offer a practical advance for long-horizon agents by mitigating the efficiency-structure trade-off in memory systems, with potential applicability to extended conversational and reasoning tasks.

major comments (3)

[Methods] The Methods section provides no quantitative fidelity metric (e.g., fact-recall or human-verified error rate) for the semantic consolidation step; without this, the central claim that consolidation reliably induces accurate cross-event connections while preserving details cannot be evaluated.
[Experiments] The Experiments and Results sections omit the evaluation protocol, exact baselines, statistical significance tests, and any ablation isolating the consolidation component or measuring drift across cycles; these omissions leave the reported gains on LoCoMo and efficiency metrics unsupported.
[Results] No section reports an ablation or controlled measurement of information loss or distortion introduced by the LLM-based consolidation, which directly bears on whether the claimed accuracy improvements and token/runtime savings are sustainable.

minor comments (2)

[Abstract] The GitHub link in the abstract should be accompanied by explicit instructions for reproducing the LoCoMo experiments and memory configurations.
[Methods] Notation for the dual-perspective anchoring and consolidation frequency is introduced without a clear diagram or pseudocode, reducing clarity for readers attempting to implement the framework.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for more rigorous evaluation of the semantic consolidation step. We will revise the manuscript to incorporate quantitative metrics, detailed protocols, and ablations as outlined below.

read point-by-point responses

Referee: [Methods] The Methods section provides no quantitative fidelity metric (e.g., fact-recall or human-verified error rate) for the semantic consolidation step; without this, the central claim that consolidation reliably induces accurate cross-event connections while preserving details cannot be evaluated.

Authors: We agree that the absence of a quantitative fidelity metric limits evaluation of the consolidation step. In the revised manuscript, we will add a fact-recall accuracy metric on a held-out event subset (comparing pre- and post-consolidation recall) and human-verified precision for cross-event connections. These additions will directly support the reliability claims. revision: yes
Referee: [Experiments] The Experiments and Results sections omit the evaluation protocol, exact baselines, statistical significance tests, and any ablation isolating the consolidation component or measuring drift across cycles; these omissions leave the reported gains on LoCoMo and efficiency metrics unsupported.

Authors: We will expand the Experiments section with a complete evaluation protocol description, exact baseline implementations and hyperparameters, statistical significance tests (e.g., paired t-tests with p-values), and ablations that isolate the consolidation component. We will also include performance measurements across successive consolidation cycles to quantify any drift. revision: yes
Referee: [Results] No section reports an ablation or controlled measurement of information loss or distortion introduced by the LLM-based consolidation, which directly bears on whether the claimed accuracy improvements and token/runtime savings are sustainable.

Authors: We acknowledge this gap. The revised Results section will include a controlled ablation measuring information loss via semantic similarity scores between original and consolidated memories, plus QA performance comparisons before/after consolidation. This will demonstrate that efficiency gains do not come at the cost of unsustainable distortion. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmark evaluation

full rationale

The paper presents StructMem as an engineering framework that combines temporal anchoring of dual perspectives with periodic semantic consolidation to build a hierarchical memory graph. All performance claims (improved temporal reasoning and multi-hop QA on LoCoMo, plus reductions in tokens/API calls/runtime) are justified by direct experimental comparison against prior memory systems rather than by any derivation, equation, fitted parameter, or self-referential definition. No mathematical steps, uniqueness theorems, ansatzes, or renamings of known results appear; the method is described procedurally and evaluated externally. This is the standard non-circular case for an applied systems paper whose central results are falsifiable via the reported benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities beyond the framework name itself are stated.

pith-pipeline@v0.9.0 · 5453 in / 956 out tokens · 43071 ms · 2026-05-09T22:16:47.010976+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 2 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2511.01448 , year=

Licomemory: Lightweight and cognitive agen- tic memory for efficient long-term reasoning.arXiv preprint arXiv:2511.01448. Sangyeop Kim, Yohan Lee, Sanghwa Kim, Hyunjong Kim, and Sungzoon Cho. 2025. Pre-storage reason- ing for episodic memory: Shifting inference burden to memory for personalized dialogue.arXiv preprint arXiv:2509.10852. Keshav Kolluru, Vai...

work page arXiv 2025
[2]

MemGPT: Towards LLMs as Operating Systems

Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13851– 13870, Bangkok, Thailand. Association for Compu- tational Linguistics. Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez....

work page internal anchor Pith review arXiv 2023
[3]

For each message, stop and **carefully** evaluate its content before moving to the next

You MUST process messages **strictly in ascending source_id order** (lowest → highest). For each message, stop and **carefully** evaluate its content before moving to the next. Do NOT reorder, batch-skip, or skip ahead — treat messages one-by-one
[4]

For each message, decide whether it contains any factual information

You MUST process every user message in order, one by one. For each message, decide whether it contains any factual information. - If yes → extract it and rephrase into a standalone sentence. - Do NOT skip just because the information looks minor, trivial, or unimportant. Extract ALL meaningful information including: * Past events and current states * Futu...
[5]

The Name of the Wind

**CRITICAL - Preserve All Specific Details**: When extracting facts, you MUST include ALL specific entities and details mentioned: - **Full names with context**: "The Name of the Wind" by Patrick Rothfuss (not just "a book") - **Complete location names**: Galway, Ireland; The Cliffs of Moher; Barcelona (not just "a city") - **Specific event names**: benef...
[6]

Perform light contextual completion so that each fact is a clear standalone statement
[7]

<fact with ALL details> <relative time> <timestamp>

**Time Handling**: Note: Distinguish mention time (when said) vs event time (when happened). - For events with relative time (yesterday, last week, X ago, next month): Preserve the relative time and reference the message timestamp (YYYY-MM-DD). Format: "<fact with ALL details> <relative time> <timestamp>." - For ongoing/timeless facts: No time annotation ...
[8]

data"`, which is a list of items: {

Output format: Always return a JSON object with key `"data"`, which is a list of items: { "source_id": "<source_id>", "fact": "<completed standalone fact with all specific details>" } Examples: --- Topic 1 --- [2024-01-07T17:24:00.000, Sun] 0.Tim: Hey John! Next month I'm off to Ireland for a semester in Galway [2024-01-07T17:24:01.000, Sun] 1.John: That'...

2024
[9]

**Focus on Relational Behaviors and Emotional Exchange**: Extract interactions showing how people relate to each other: - Evaluative: praise, compliment, admire, acknowledge - Supportive: encourage, express confidence, cheer on, offer support - Emotional: express gratitude, pride, happiness, excitement, congratulations - Engagement: ask questions, show in...
[10]

Alice praised Bob's empathy

**What to Extract vs. What to Skip**: Extract: "Alice praised Bob's empathy" (relational behavior) Extract: "Alice asked about Bob's motivation" (engagement behavior) Extract: "Bob expressed gratitude for Alice's support" (emotional response) Skip: "Bob mentioned her support group experience" (factual content only) Skip: "Alice said she's been painting" (...
[11]

Alice praised Bob's dedication to helping LGBTQ youth

**Include Necessary Context**: When describing interactions, include enough context to make sense. - Extract: "Alice praised Bob's dedication to helping LGBTQ youth" - Not just: "Alice praised Bob"
[12]

Alice empathized with Bob's job search struggles by sharing her own experience from last year

**Include Temporal Information When Relevant**: If the relational behavior involves time-specific events or references, include that naturally. - "Alice empathized with Bob's job search struggles by sharing her own experience from last year" - "Bob congratulated Alice on her grad school acceptance" For general emotional exchanges without time context, no ...
[13]

Alice congratulated Bob on passing the interviews and expressed excitement for her future

**Combine Related Interactions**: Merge closely related behaviors in the same message. - "Alice congratulated Bob on passing the interviews and expressed excitement for her future"
[14]

both" for Mutual Agreement**: When both people express similar views or bond over shared experiences. -

**Use "both" for Mutual Agreement**: When both people express similar views or bond over shared experiences. - "Alice and Bob both emphasized the importance of self-care" - Assign to source_id where the second person completes the agreement Figure 6: Relational entry extraction prompt (Part 1). Output format: Return JSON with key "data", containing a list...

2024
[23]

Figure 8: Narrative synthesis prompt for Synthesis

Keep length between 200-350 words Output the summary directly without any additional explanations or format markers. Figure 8: Narrative synthesis prompt for Synthesis. [SYSTEM]: You are an entity extractor for conversational messages. Input: ENTITY_TYPES, PREVIOUS_MESSAGES, CURRENT_MESSAGES (a list of entries). Task: extract distinct entities explicitly ...
[24]

Speaker: for each entry, extract the speaker (before the colon) as an entity if present; merge repeated mentions of the same speaker
[25]

Only emit entities that appear in CURRENT_MESSAGES; use PREVIOUS_MESSAGES only for coreference resolution
[26]

Names: clear, unambiguous, lowercase_with_underscores
[27]

Use this entity type if the entity is not one of the other listed types

Types: choose entity_type from ENTITY_TYPES, Default entity classification is Entity. Use this entity type if the entity is not one of the other listed types
[28]

entities

Do not emit pronouns (you/I/he/she/they), times, dates, or pure actions/relations. Output JSON ONLY: { "entities": [ { "id": <int>, // temporary id for this extraction round "name": "lower_snake_case", "entity_type": "string_from_ENTITY_TYPES", "aliases": ["..."], "first_entry_index": <int> // index in CURRENT_MESSAGES where entity first appears (0-based)...
[29]

source and destination must be names from ENTITIES and must be distinct
[30]

Prefer FACT_TYPES when applicable

relationship in SCREAMING_SNAKE_CASE (e.g., GREETED, WORKS_AT). Prefer FACT_TYPES when applicable
[31]

facts": [ {

Remove duplicates/near-duplicates across the batch. Output JSON ONLY: { "facts": [ { "source": "lower_snake_case", "relationship": "SCREAMING_SNAKE_CASE", "destination": "lower_snake_case", "fact": "concise paraphrase", "source_entry_index": <int> // index of CURRENT_MESSAGES where this fact was observed (0-based) } ] } [USER]: "FACT_TYPES": {fact_types},...
[46]

Prompts for Flat Memory system QA [SYSTEM]: You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories

Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_name}: {speaker_1_memories} Memories for user {speaker_2_name}: {speaker_2_memories} Session summaries: {session_summaries} Question: {question} Answer: Figure 13: Question answering prompt for StructMem system. Prompts for Flat Memory system QA [SYSTEM]: Yo...
[54]

# APPROACH (Think step by step):

The answer should be less than 5-6 words. # APPROACH (Think step by step):
[61]

Prompts for Graph Memory system QA [SYSTEM]: You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories

Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_name}: {speaker_1_memories} Memories for user {speaker_2_name}: {speaker_2_memories} Question: {question} Answer: Figure 14: Question answering prompt for flat memory systems. Prompts for Graph Memory system QA [SYSTEM]: You are an intelligent memory assista...
[62]

Carefully analyze all provided memories from both speakers
[63]

Pay special attention to the timestamps to determine the answer
[64]

If the question asks about a specific event or fact, look for direct evidence in the memories
[65]

If the memories contain contradictory information, prioritize the most recent memory
[66]

last year

If there is a question about time references (like "last year", "two months ago", etc.), calculate the actual date based on the memory timestamp. For example, if a memory from 4 May 2022 mentions "went to India last year," then the trip occurred in 2021

2022
[67]

last year

Always convert relative time references to specific dates, months, or years. For example, convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory timestamp. Ignore the reference while answering the question

2022
[68]

Do not confuse character names mentioned in memories with the actual users who created those memories

Focus only on the content of the memories from both speakers. Do not confuse character names mentioned in memories with the actual users who created those memories
[69]

The answer should be less than 5-6 words
[70]

# APPROACH (Think step by step):

Use the knowledge graph relations to understand the user's knowledge network and identify important relationships between entities in the user's world. # APPROACH (Think step by step):
[71]

First, examine all memories that contain information related to the question
[72]

Examine the timestamps and content of these memories carefully
[73]

Look for explicit mentions of dates, times, locations, or events that answer the question
[74]

If the answer requires calculation (e.g., converting relative time references), show your work
[75]

Analyze the knowledge graph relations to understand the user's knowledge context
[76]

Formulate a precise, concise answer based solely on the evidence in the memories
[77]

Double-check that your answer directly addresses the question asked
[78]

last Tuesday

Ensure your final answer is specific and avoids vague time references Memories for user {{speaker_1_user_id}}: {{speaker_1_memories}} Relations for user {{speaker_1_user_id}}: {{speaker_1_graph_memories}} Memories for user {{speaker_2_user_id}}: {{speaker_2_memories}} Relations for user {{speaker_2_user_id}}: {{speaker_2_graph_memories}} Question: {{quest...
[79]

**Factual information**: concrete events, plans, opinions, preferences
[80]

**Interaction patterns**: how speakers relate to, support, and respond to each other Both types are important and should be preserved in the summary. Conversation Time: {bucket} Participants: {speakers} Conversation Records: {aggregated_text} Related Temporal Context (from other time periods): {supplementary_context} Please generate a summary with the fol...
[81]

Remove redundant repetitions while keeping all key information mentioned above
[82]

Organize content chronologically, showing how facts and interactions unfold together
[83]

X happened, which gave Y the courage to do Z

Highlight causal relationships (e.g., "X happened, which gave Y the courage to do Z")
[84]

on 2022 April 15

When integrating temporal context: - Cite specific times if available (e.g., "on 2022 April 15...") - Focus on concrete connections, not general patterns - Weave references naturally into the narrative, don't append them as separate summary - Only include if it adds meaningful context to understanding current events

2022
[85]

Balance factual timeline with emotional/relational dynamics
[86]

Use fluent, concise natural language
[87]

Figure 20: Narrative synthesis prompt for Unconstrained Synthesis

Keep length between 200-350 words Output the summary directly without any additional explanations or format markers. Figure 20: Narrative synthesis prompt for Unconstrained Synthesis. The text highlighted in gray represents the grounding constraints that are active in our defaultConstrainedsetting but disabled for theUnconstrainedvariant to evaluate their...