Relational Gaze Transitions During Encoding Predict Episodic Recall of Naturalistic Scenes

Alex Kafkas; Hugo Rydel

arxiv: 2606.20844 · v1 · pith:KTIT2IKDnew · submitted 2026-06-18 · 🧬 q-bio.NC

Relational Gaze Transitions During Encoding Predict Episodic Recall of Naturalistic Scenes

Hugo Rydel , Alex Kafkas This is my paper

Pith reviewed 2026-06-26 14:47 UTC · model grok-4.3

classification 🧬 q-bio.NC

keywords eye trackingepisodic memoryscene perceptionrelational processinggaze transitionsnaturalistic scenesencodingfree recall

0 comments

The pith

Relational gaze transitions during first viewing of naturalistic scenes predict later free recall of objects and relations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether eye movements that shift between meaningfully related objects in complex scenes serve as a behavioral marker of how the brain organizes details into memorable events. It applies scene-graph labels to eye-tracking recordings to quantify these relational transitions both while participants first view the scenes and later when they retrieve them from memory on a blank screen. Relational scanning during initial encoding reliably forecasts success in recalling both individual objects and the links between them, even after statistical controls for low-level visual salience, number of fixations, semantic content, and overall image properties. In contrast, the same relational gaze measure during retrieval does not forecast recall accuracy. The findings position relational gaze as functionally important for memory formation rather than for retrieval itself.

Core claim

By annotating naturalistic scenes with graphs that link objects according to real-world relations, the study measures how often gaze moves between connected nodes during encoding. Participants exhibit above-chance relational gaze both at initial viewing and during blank-screen retrieval. The frequency of these encoding-phase transitions correlates with subsequent free-recall performance for object identities and for the relations themselves, surviving controls for salience, fixation count, meaning, and image-level variance. Retrieval-phase relational gaze shows no such predictive relation, indicating that the organizational process tracked by gaze is most critical while the memory is being l

What carries the argument

Scene-graph annotations applied to eye-tracking data to quantify relational gaze transitions (gaze shifts between meaningfully connected objects).

If this is right

Relational gaze during encoding contributes to binding object details into coherent episodic memories.
The same gaze measure can be extracted from complex, real-world scenes rather than only simplified displays.
Relational organization occurs during initial exposure rather than during later retrieval attempts.
Gaze-based metrics may index successful memory formation independently of low-level visual features.
The approach extends measurement of relational processing to naturalistic viewing conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If relational gaze marks encoding success, training or guiding such transitions could improve memory in applied settings such as education or eyewitness testimony.
The dissociation between encoding and retrieval phases suggests that interventions timed to initial exposure may be more effective than those applied at test.
Future work could test whether disrupting relational gaze patterns during viewing selectively impairs relational memory while sparing item memory.
The method may generalize to dynamic video scenes if scene graphs can be extended over time.

Load-bearing premise

The scene graphs accurately represent the object relations that participants actually process while viewing the scenes.

What would settle it

A replication in which relational gaze at encoding no longer predicts recall once scene graphs are replaced by random or purely spatial object pairings.

read the original abstract

Remembering a visual scene requires organizing distinct details into a cohesive event. This study investigates whether relation-guided gaze transitions provide a behavioural marker of this cognitive organization during episodic encoding and retrieval. By applying scene graph annotations to eye-tracking data, we measured whether gaze moved between objects that were meaningfully related within complex scenes. This approach allowed us to quantify relational scanning within naturalistic environments, moving beyond prior methods that relied on simplified displays or isolated relation types. Participants showed above-chance relational gaze during both initial viewing and blank-screen retrieval, indicating that gaze actively tracks scene structure during first viewing and at recall. Additionally, relational scanning at encoding predicted subsequent free recall of both object and relational details, even after accounting for salience, fixation frequency, meaning, and image-level differences. In contrast, relational scanning at retrieval did not predict recall success, suggesting that relational gaze is most functional to memory during its formation. Together, these findings show that relational gaze can be measured in complex scenes and may serve as a marker of episodic encoding during natural visual exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Relational gaze at encoding predicts recall via scene graphs in natural scenes, but the graphs' match to actual participant processing remains untested.

read the letter

The core result is that relational gaze transitions—defined as eye movements between objects linked in scene-graph annotations—during initial scene viewing predict later free recall of both objects and relations, while the same measure during retrieval does not. This holds after controls for salience, fixation count, semantic meaning, and image-level variance.

The paper does a clean job extending prior eye-tracking work from simplified arrays to naturalistic scenes. Applying existing scene-graph annotations lets them quantify relational scanning without new stimuli, and the encoding-retrieval dissociation is a useful addition. Participants show above-chance relational gaze in both phases, which is consistent with the idea that gaze tracks scene structure.

The main soft spot is the assumption that annotator-derived graph edges reflect relations participants actually encode. The controls rule out some low-level confounds, but they do not test whether viewers process the specific edges used in the analysis. If many graph relations are not salient during viewing, the memory correlation could partly reflect object co-occurrence rather than relational organization. Methods details on how graphs were built and how multiple edges per image were handled would clarify this.

The work is aimed at researchers studying visual episodic memory and eye movements as encoding markers. Anyone already using scene graphs or interested in relational processing would find the approach and the encoding-specific result worth reading. The central claim is internally consistent and the question is well-posed, so the paper merits a serious referee even if revisions are needed on the graph-validation point.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that applying scene-graph annotations to eye-tracking data from participants viewing naturalistic scenes reveals above-chance relational gaze transitions (between meaningfully connected objects) during both encoding and blank-screen retrieval. Relational scanning at encoding predicts subsequent free recall of both object and relational details, even after statistical controls for salience, fixation frequency, meaning, and image-level differences; the same measure at retrieval does not predict recall success. The authors interpret relational gaze as a behavioral marker of episodic encoding in complex, real-world scenes.

Significance. If the central empirical result is robust, the work supplies a measurable, naturalistic index of relational organization during memory formation that goes beyond simplified displays or single relation types. The encoding-versus-retrieval dissociation and the reported controls are potentially informative for models linking visual exploration to episodic memory.

major comments (2)

[Methods] Methods (scene-graph construction and application): the central claim that relational gaze indexes cognitive organization relevant to encoding rests on the untested assumption that static, annotator-derived scene-graph edges correspond to the relations participants actually process during viewing. No participant validation, salience-weighted edge analysis, or comparison against alternative relational annotations is described to rule out the possibility that the predictive effect is driven by low-level co-occurrence or annotator bias rather than memory-relevant structure.
[Results] Results (control analyses): although the abstract states that the encoding prediction survives controls for salience, fixation frequency, meaning, and image-level differences, the manuscript does not report the precise operationalization of these covariates, the model specifications, or effect-size changes after each control. Without these details it is impossible to evaluate whether the relational-gaze term remains load-bearing once all plausible confounds are entered.

minor comments (2)

[Abstract] The abstract and introduction should explicitly define 'relational scanning' (e.g., proportion of transitions between graph-connected objects versus total transitions) and state the chance baseline used for the above-chance claim.
[Figures] Figure legends and methods should clarify how blank-screen retrieval trials were aligned with the original scene graphs for the relational-gaze measure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Methods] Methods (scene-graph construction and application): the central claim that relational gaze indexes cognitive organization relevant to encoding rests on the untested assumption that static, annotator-derived scene-graph edges correspond to the relations participants actually process during viewing. No participant validation, salience-weighted edge analysis, or comparison against alternative relational annotations is described to rule out the possibility that the predictive effect is driven by low-level co-occurrence or annotator bias rather than memory-relevant structure.

Authors: We acknowledge that the scene-graph annotations are static and annotator-derived without direct participant validation of the specific relations processed during viewing. While these annotations follow standard protocols from the scene-understanding literature and the reported effects survive controls for low-level factors, we agree this leaves open the possibility of annotator bias or co-occurrence driving the results. In revision we will add a limitations paragraph explicitly discussing this assumption, report inter-annotator agreement statistics for the scene graphs, and outline potential future validation approaches. No new participant data collection is feasible at this stage. revision: partial
Referee: [Results] Results (control analyses): although the abstract states that the encoding prediction survives controls for salience, fixation frequency, meaning, and image-level differences, the manuscript does not report the precise operationalization of these covariates, the model specifications, or effect-size changes after each control. Without these details it is impossible to evaluate whether the relational-gaze term remains load-bearing once all plausible confounds are entered.

Authors: The control analyses appear in the Results, but we agree that the precise operational definitions, full model specifications, and stepwise effect-size changes are not reported with sufficient detail. In the revised manuscript we will expand the Methods and Results sections to define each covariate explicitly (including how salience maps, meaning ratings, and image-level factors were quantified), provide the complete mixed-effects regression equations, and add supplementary tables showing coefficient estimates and effect sizes before versus after each successive control. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports an empirical eye-tracking study that applies pre-existing scene graph annotations to measure relational gaze transitions in naturalistic scenes, then tests whether those transitions statistically predict subsequent free recall after controlling for salience, fixation frequency, meaning, and image-level factors. All load-bearing steps consist of data collection, annotation application, and regression analyses on observed participant behavior; none reduce by definition or self-citation to the target outcome. The derivation chain is therefore self-contained against external benchmarks (recall performance) and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract mentions no free parameters, invented entities, or non-standard axioms. Standard statistical assumptions for regression-based prediction are implicit but not detailed.

axioms (1)

domain assumption Regression models can isolate the unique contribution of relational gaze after controlling for salience, fixation frequency, meaning, and image-level factors
Invoked when stating that the prediction holds after accounting for those variables.

pith-pipeline@v0.9.1-grok · 5708 in / 1131 out tokens · 23240 ms · 2026-06-26T14:47:59.001500+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages

[1]

https://doi.org/10.1016/j.neuron.2017.06.036 Federico, G., & Brandimonte, M. A. (2019). Tool and object affordances: An ecological eye- tracking study. Brain and Cognition, 135, Article 103582. https://doi.org/10.1016/j.bandc.2019.103582 Fehlmann, B., Coynel, D., Schicktanz, N., Milnik, A., Gschwind, L., Hofmann, P., Papassotiropoulos, A., & de Quervain, ...

work page doi:10.1016/j.neuron.2017.06.036 2017
[2]

eye movements to nothing

https://doi.org/10.1037/a0014420 Radvansky, G. A., & Zacks, J. M. (2014). Event cognition. Oxford University Press. Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3), 372–422. https://doi.org/10.1037/0033-2909.124.3.372 Rust, N. C., & Mehrpour, V. (2020). Understanding image memora...

work page doi:10.1037/a0014420 2014
[3]

Episodic memory: From mind to brain.Annual Review of Psychology, 53(1):1–25, 2002

https://doi.org/10.1146/annurev.psych.53.100901.135114 Vestner, T., Flavell, J. C., Cook, R., & Tipper, S. P. (2022). Remembered together: Social interaction facilitates retrieval while reducing individuation of features within bound representations. Quarterly Journal of Experimental Psychology, 75(9), 1593–1602. https://doi.org/10.1177/17470218211056499 ...

work page doi:10.1146/annurev.psych.53.100901.135114 2022
[4]

correct". - If the concept is a participant hallucination — not in the image and not reasonably inferred — set status to

VERIFY STATUS Look at the image carefully. For each node: - If the concept is physically present or visually deducible (latent), set status to "correct". - If the concept is a participant hallucination — not in the image and not reasonably inferred — set status to "incorrect". - Override the draft status whenever the image contradicts it. Example: if part...
[5]

Person" Fix) Review source_phrases for broad nodes where participants described mutually exclusive things (e.g., one said

SPLIT CONTRADICTIONS (The "Person" Fix) Review source_phrases for broad nodes where participants described mutually exclusive things (e.g., one said "blonde man", another said "woman" for the same person node). - Split these into distinct nodes. - Create one node for the visually verified truth (e.g., concept: "woman", status: "correct"). - Create separat...
[6]

a", "b",

PRESERVE IDs - If you keep a node unchanged, preserve its original node_id exactly. - If you split a node into two or more, use suffix notation: original_id + "a", "b", "c" (e.g., "2387122_002a"). - New nodes you add that have no original counterpart should use the pattern: "{StimID}_{next_available_index}"
[7]

dog" → [

PRESERVE FIELDS - Every output node must have exactly these fields: node_id, concept, context, content_type, evidence_type, status, source_phrases. - Do not add or remove fields. - source_phrases: for each participant who mentioned this concept, extract the 1-2 words immediately surrounding the concept that confirm the match. Do NOT copy full sentences. E...
[8]

Only return nodes where the concept WAS recalled

RECALL DECISION For each node in the Codebook, decide whether the participant's response contains this concept. Only return nodes where the concept WAS recalled. Omit nodes that were not recalled. Nodes absent from your response will automatically be scored as recalled=false
[9]

kitty" matches concept

SEMANTIC MATCHING (not keyword matching) Match on meaning, not exact words. Examples: - "kitty" matches concept "cat" - "typing" in the context of "typing on keyboard" matches concept "typing" - "sitting on something blue" matches concept "blue" AND concept "on" (spatial) - "furry animal" does NOT match concept "cat" — too vague, could be any animal Use j...
[10]

MATCHED PHRASE For each recalled node, copy the shortest phrase from the response that triggered the match
[11]

incorrect

ONLY SCORE CORRECT NODES Nodes with status "incorrect" in the Codebook are hallucinated concepts — do not return them even if the participant mentions them
[12]

node_id":

PRESERVE NODE IDs Use node_id values exactly as they appear in the Codebook. Do not add or invent node IDs. Return ONLY a valid JSON array of recalled nodes. No preamble, no explanation, no markdown fences. Each element must have exactly: node_id (string), matched_phrase (string). If no nodes were recalled, return an empty array: [] Appendix F: Excerpted ...

[1] [1]

https://doi.org/10.1016/j.neuron.2017.06.036 Federico, G., & Brandimonte, M. A. (2019). Tool and object affordances: An ecological eye- tracking study. Brain and Cognition, 135, Article 103582. https://doi.org/10.1016/j.bandc.2019.103582 Fehlmann, B., Coynel, D., Schicktanz, N., Milnik, A., Gschwind, L., Hofmann, P., Papassotiropoulos, A., & de Quervain, ...

work page doi:10.1016/j.neuron.2017.06.036 2017

[2] [2]

eye movements to nothing

https://doi.org/10.1037/a0014420 Radvansky, G. A., & Zacks, J. M. (2014). Event cognition. Oxford University Press. Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3), 372–422. https://doi.org/10.1037/0033-2909.124.3.372 Rust, N. C., & Mehrpour, V. (2020). Understanding image memora...

work page doi:10.1037/a0014420 2014

[3] [3]

Episodic memory: From mind to brain.Annual Review of Psychology, 53(1):1–25, 2002

https://doi.org/10.1146/annurev.psych.53.100901.135114 Vestner, T., Flavell, J. C., Cook, R., & Tipper, S. P. (2022). Remembered together: Social interaction facilitates retrieval while reducing individuation of features within bound representations. Quarterly Journal of Experimental Psychology, 75(9), 1593–1602. https://doi.org/10.1177/17470218211056499 ...

work page doi:10.1146/annurev.psych.53.100901.135114 2022

[4] [4]

correct". - If the concept is a participant hallucination — not in the image and not reasonably inferred — set status to

VERIFY STATUS Look at the image carefully. For each node: - If the concept is physically present or visually deducible (latent), set status to "correct". - If the concept is a participant hallucination — not in the image and not reasonably inferred — set status to "incorrect". - Override the draft status whenever the image contradicts it. Example: if part...

[5] [5]

Person" Fix) Review source_phrases for broad nodes where participants described mutually exclusive things (e.g., one said

SPLIT CONTRADICTIONS (The "Person" Fix) Review source_phrases for broad nodes where participants described mutually exclusive things (e.g., one said "blonde man", another said "woman" for the same person node). - Split these into distinct nodes. - Create one node for the visually verified truth (e.g., concept: "woman", status: "correct"). - Create separat...

[6] [6]

a", "b",

PRESERVE IDs - If you keep a node unchanged, preserve its original node_id exactly. - If you split a node into two or more, use suffix notation: original_id + "a", "b", "c" (e.g., "2387122_002a"). - New nodes you add that have no original counterpart should use the pattern: "{StimID}_{next_available_index}"

[7] [7]

dog" → [

PRESERVE FIELDS - Every output node must have exactly these fields: node_id, concept, context, content_type, evidence_type, status, source_phrases. - Do not add or remove fields. - source_phrases: for each participant who mentioned this concept, extract the 1-2 words immediately surrounding the concept that confirm the match. Do NOT copy full sentences. E...

[8] [8]

Only return nodes where the concept WAS recalled

RECALL DECISION For each node in the Codebook, decide whether the participant's response contains this concept. Only return nodes where the concept WAS recalled. Omit nodes that were not recalled. Nodes absent from your response will automatically be scored as recalled=false

[9] [9]

kitty" matches concept

SEMANTIC MATCHING (not keyword matching) Match on meaning, not exact words. Examples: - "kitty" matches concept "cat" - "typing" in the context of "typing on keyboard" matches concept "typing" - "sitting on something blue" matches concept "blue" AND concept "on" (spatial) - "furry animal" does NOT match concept "cat" — too vague, could be any animal Use j...

[10] [10]

MATCHED PHRASE For each recalled node, copy the shortest phrase from the response that triggered the match

[11] [11]

incorrect

ONLY SCORE CORRECT NODES Nodes with status "incorrect" in the Codebook are hallucinated concepts — do not return them even if the participant mentions them

[12] [12]

node_id":

PRESERVE NODE IDs Use node_id values exactly as they appear in the Codebook. Do not add or invent node IDs. Return ONLY a valid JSON array of recalled nodes. No preamble, no explanation, no markdown fences. Each element must have exactly: node_id (string), matched_phrase (string). If no nodes were recalled, return an empty array: [] Appendix F: Excerpted ...