Relational Gaze Transitions During Encoding Predict Episodic Recall of Naturalistic Scenes
Pith reviewed 2026-06-26 14:47 UTC · model grok-4.3
The pith
Relational gaze transitions during first viewing of naturalistic scenes predict later free recall of objects and relations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By annotating naturalistic scenes with graphs that link objects according to real-world relations, the study measures how often gaze moves between connected nodes during encoding. Participants exhibit above-chance relational gaze both at initial viewing and during blank-screen retrieval. The frequency of these encoding-phase transitions correlates with subsequent free-recall performance for object identities and for the relations themselves, surviving controls for salience, fixation count, meaning, and image-level variance. Retrieval-phase relational gaze shows no such predictive relation, indicating that the organizational process tracked by gaze is most critical while the memory is being l
What carries the argument
Scene-graph annotations applied to eye-tracking data to quantify relational gaze transitions (gaze shifts between meaningfully connected objects).
If this is right
- Relational gaze during encoding contributes to binding object details into coherent episodic memories.
- The same gaze measure can be extracted from complex, real-world scenes rather than only simplified displays.
- Relational organization occurs during initial exposure rather than during later retrieval attempts.
- Gaze-based metrics may index successful memory formation independently of low-level visual features.
- The approach extends measurement of relational processing to naturalistic viewing conditions.
Where Pith is reading between the lines
- If relational gaze marks encoding success, training or guiding such transitions could improve memory in applied settings such as education or eyewitness testimony.
- The dissociation between encoding and retrieval phases suggests that interventions timed to initial exposure may be more effective than those applied at test.
- Future work could test whether disrupting relational gaze patterns during viewing selectively impairs relational memory while sparing item memory.
- The method may generalize to dynamic video scenes if scene graphs can be extended over time.
Load-bearing premise
The scene graphs accurately represent the object relations that participants actually process while viewing the scenes.
What would settle it
A replication in which relational gaze at encoding no longer predicts recall once scene graphs are replaced by random or purely spatial object pairings.
read the original abstract
Remembering a visual scene requires organizing distinct details into a cohesive event. This study investigates whether relation-guided gaze transitions provide a behavioural marker of this cognitive organization during episodic encoding and retrieval. By applying scene graph annotations to eye-tracking data, we measured whether gaze moved between objects that were meaningfully related within complex scenes. This approach allowed us to quantify relational scanning within naturalistic environments, moving beyond prior methods that relied on simplified displays or isolated relation types. Participants showed above-chance relational gaze during both initial viewing and blank-screen retrieval, indicating that gaze actively tracks scene structure during first viewing and at recall. Additionally, relational scanning at encoding predicted subsequent free recall of both object and relational details, even after accounting for salience, fixation frequency, meaning, and image-level differences. In contrast, relational scanning at retrieval did not predict recall success, suggesting that relational gaze is most functional to memory during its formation. Together, these findings show that relational gaze can be measured in complex scenes and may serve as a marker of episodic encoding during natural visual exploration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that applying scene-graph annotations to eye-tracking data from participants viewing naturalistic scenes reveals above-chance relational gaze transitions (between meaningfully connected objects) during both encoding and blank-screen retrieval. Relational scanning at encoding predicts subsequent free recall of both object and relational details, even after statistical controls for salience, fixation frequency, meaning, and image-level differences; the same measure at retrieval does not predict recall success. The authors interpret relational gaze as a behavioral marker of episodic encoding in complex, real-world scenes.
Significance. If the central empirical result is robust, the work supplies a measurable, naturalistic index of relational organization during memory formation that goes beyond simplified displays or single relation types. The encoding-versus-retrieval dissociation and the reported controls are potentially informative for models linking visual exploration to episodic memory.
major comments (2)
- [Methods] Methods (scene-graph construction and application): the central claim that relational gaze indexes cognitive organization relevant to encoding rests on the untested assumption that static, annotator-derived scene-graph edges correspond to the relations participants actually process during viewing. No participant validation, salience-weighted edge analysis, or comparison against alternative relational annotations is described to rule out the possibility that the predictive effect is driven by low-level co-occurrence or annotator bias rather than memory-relevant structure.
- [Results] Results (control analyses): although the abstract states that the encoding prediction survives controls for salience, fixation frequency, meaning, and image-level differences, the manuscript does not report the precise operationalization of these covariates, the model specifications, or effect-size changes after each control. Without these details it is impossible to evaluate whether the relational-gaze term remains load-bearing once all plausible confounds are entered.
minor comments (2)
- [Abstract] The abstract and introduction should explicitly define 'relational scanning' (e.g., proportion of transitions between graph-connected objects versus total transitions) and state the chance baseline used for the above-chance claim.
- [Figures] Figure legends and methods should clarify how blank-screen retrieval trials were aligned with the original scene graphs for the relational-gaze measure.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We respond to each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Methods] Methods (scene-graph construction and application): the central claim that relational gaze indexes cognitive organization relevant to encoding rests on the untested assumption that static, annotator-derived scene-graph edges correspond to the relations participants actually process during viewing. No participant validation, salience-weighted edge analysis, or comparison against alternative relational annotations is described to rule out the possibility that the predictive effect is driven by low-level co-occurrence or annotator bias rather than memory-relevant structure.
Authors: We acknowledge that the scene-graph annotations are static and annotator-derived without direct participant validation of the specific relations processed during viewing. While these annotations follow standard protocols from the scene-understanding literature and the reported effects survive controls for low-level factors, we agree this leaves open the possibility of annotator bias or co-occurrence driving the results. In revision we will add a limitations paragraph explicitly discussing this assumption, report inter-annotator agreement statistics for the scene graphs, and outline potential future validation approaches. No new participant data collection is feasible at this stage. revision: partial
-
Referee: [Results] Results (control analyses): although the abstract states that the encoding prediction survives controls for salience, fixation frequency, meaning, and image-level differences, the manuscript does not report the precise operationalization of these covariates, the model specifications, or effect-size changes after each control. Without these details it is impossible to evaluate whether the relational-gaze term remains load-bearing once all plausible confounds are entered.
Authors: The control analyses appear in the Results, but we agree that the precise operational definitions, full model specifications, and stepwise effect-size changes are not reported with sufficient detail. In the revised manuscript we will expand the Methods and Results sections to define each covariate explicitly (including how salience maps, meaning ratings, and image-level factors were quantified), provide the complete mixed-effects regression equations, and add supplementary tables showing coefficient estimates and effect sizes before versus after each successive control. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper reports an empirical eye-tracking study that applies pre-existing scene graph annotations to measure relational gaze transitions in naturalistic scenes, then tests whether those transitions statistically predict subsequent free recall after controlling for salience, fixation frequency, meaning, and image-level factors. All load-bearing steps consist of data collection, annotation application, and regression analyses on observed participant behavior; none reduce by definition or self-citation to the target outcome. The derivation chain is therefore self-contained against external benchmarks (recall performance) and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Regression models can isolate the unique contribution of relational gaze after controlling for salience, fixation frequency, meaning, and image-level factors
Reference graph
Works this paper leans on
-
[1]
https://doi.org/10.1016/j.neuron.2017.06.036 Federico, G., & Brandimonte, M. A. (2019). Tool and object affordances: An ecological eye- tracking study. Brain and Cognition, 135, Article 103582. https://doi.org/10.1016/j.bandc.2019.103582 Fehlmann, B., Coynel, D., Schicktanz, N., Milnik, A., Gschwind, L., Hofmann, P., Papassotiropoulos, A., & de Quervain, ...
-
[2]
https://doi.org/10.1037/a0014420 Radvansky, G. A., & Zacks, J. M. (2014). Event cognition. Oxford University Press. Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3), 372–422. https://doi.org/10.1037/0033-2909.124.3.372 Rust, N. C., & Mehrpour, V. (2020). Understanding image memora...
-
[3]
Episodic memory: From mind to brain.Annual Review of Psychology, 53(1):1–25, 2002
https://doi.org/10.1146/annurev.psych.53.100901.135114 Vestner, T., Flavell, J. C., Cook, R., & Tipper, S. P. (2022). Remembered together: Social interaction facilitates retrieval while reducing individuation of features within bound representations. Quarterly Journal of Experimental Psychology, 75(9), 1593–1602. https://doi.org/10.1177/17470218211056499 ...
-
[4]
correct". - If the concept is a participant hallucination — not in the image and not reasonably inferred — set status to
VERIFY STATUS Look at the image carefully. For each node: - If the concept is physically present or visually deducible (latent), set status to "correct". - If the concept is a participant hallucination — not in the image and not reasonably inferred — set status to "incorrect". - Override the draft status whenever the image contradicts it. Example: if part...
-
[5]
Person" Fix) Review source_phrases for broad nodes where participants described mutually exclusive things (e.g., one said
SPLIT CONTRADICTIONS (The "Person" Fix) Review source_phrases for broad nodes where participants described mutually exclusive things (e.g., one said "blonde man", another said "woman" for the same person node). - Split these into distinct nodes. - Create one node for the visually verified truth (e.g., concept: "woman", status: "correct"). - Create separat...
-
[6]
a", "b",
PRESERVE IDs - If you keep a node unchanged, preserve its original node_id exactly. - If you split a node into two or more, use suffix notation: original_id + "a", "b", "c" (e.g., "2387122_002a"). - New nodes you add that have no original counterpart should use the pattern: "{StimID}_{next_available_index}"
-
[7]
dog" → [
PRESERVE FIELDS - Every output node must have exactly these fields: node_id, concept, context, content_type, evidence_type, status, source_phrases. - Do not add or remove fields. - source_phrases: for each participant who mentioned this concept, extract the 1-2 words immediately surrounding the concept that confirm the match. Do NOT copy full sentences. E...
-
[8]
Only return nodes where the concept WAS recalled
RECALL DECISION For each node in the Codebook, decide whether the participant's response contains this concept. Only return nodes where the concept WAS recalled. Omit nodes that were not recalled. Nodes absent from your response will automatically be scored as recalled=false
-
[9]
kitty" matches concept
SEMANTIC MATCHING (not keyword matching) Match on meaning, not exact words. Examples: - "kitty" matches concept "cat" - "typing" in the context of "typing on keyboard" matches concept "typing" - "sitting on something blue" matches concept "blue" AND concept "on" (spatial) - "furry animal" does NOT match concept "cat" — too vague, could be any animal Use j...
-
[10]
MATCHED PHRASE For each recalled node, copy the shortest phrase from the response that triggered the match
-
[11]
incorrect
ONLY SCORE CORRECT NODES Nodes with status "incorrect" in the Codebook are hallucinated concepts — do not return them even if the participant mentions them
-
[12]
node_id":
PRESERVE NODE IDs Use node_id values exactly as they appear in the Codebook. Do not add or invent node IDs. Return ONLY a valid JSON array of recalled nodes. No preamble, no explanation, no markdown fences. Each element must have exactly: node_id (string), matched_phrase (string). If no nodes were recalled, return an empty array: [] Appendix F: Excerpted ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.