pith. sign in

arxiv: 2606.21818 · v1 · pith:D4IWYZZZnew · submitted 2026-06-20 · 🧬 q-bio.NC

Dynamic Computerized Tumbling-E Testing for Temporal Reliability of Human Sequential Perceptual Decisions

Pith reviewed 2026-06-26 11:24 UTC · model grok-4.3

classification 🧬 q-bio.NC
keywords tumbling-E taskperceptual decision makingtemporal reliabilityvisual acuityreaction timeadaptive staircasedatasettimeout rate
0
0 comments X

The pith

A dynamic tumbling-E task records timing and adaptation to build a temporal reliability benchmark for perceptual decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a computerized version of the tumbling-E visual task that records reaction times, timeouts, stimulus-size changes, and reliability signals on every trial instead of producing only a single acuity score. Data from 21 participants across 77 sessions yield 1154 trials showing fast responses centered near 1.5 seconds, a 6.6 percent timeout rate, and steady convergence toward smaller optotypes. The authors introduce the Temporal Hallucination Index as a composite of delay and timeout measures to expose temporal features that static accuracy hides. This creates a human-only dataset intended as a baseline for comparing sequential decision behavior with artificial agents. The approach converts a familiar eye-chart method into a temporally resolved record of how perceptual choices unfold.

Core claim

The paper claims that the dynamic computerized tumbling-E procedure, with its automatic staircase adjustment and trial-by-trial capture of latency and timeouts, converts the task into a temporally resolved human perceptual-decision benchmark whose primary output is a dataset containing 1154 valid trials, 1078 non-timeout responses, and an observable Temporal Hallucination Index derived from delay and timeout components.

What carries the argument

The adaptive staircase that automatically reduces or increases stimulus size on each trial while logging reaction time, timeout events, and delay rate to compute the Temporal Hallucination Index from those components.

If this is right

  • Static accuracy scores can mask delays and unstable convergence that the new measures detect.
  • Human responses remain fast with low timeout rates and converge smoothly toward threshold.
  • The dataset supplies trial-level signals usable for direct comparison against artificial decision systems.
  • Staircase behavior is dominated by transitions to smaller stimuli, reaching near 20/20 levels without clinical diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recording setup could be applied to other sensory thresholds to test whether temporal reliability patterns generalize.
  • If AI agents run the identical task, differences in timeout rates or adaptation speed would become directly observable.
  • Requiring explicit formula and validation data for the THI would allow external checks on whether it adds information beyond raw timing statistics.

Load-bearing premise

The THI built from delay and timeout data supplies a meaningful and comparable index of temporal reliability.

What would settle it

Demonstration that the recorded timing and adaptation patterns fail to separate stable from unstable perceptual sequences across repeated sessions would remove the dataset's claimed value as a reliability benchmark.

Figures

Figures reproduced from arXiv: 2606.21818 by Avneek Sandhu, Bin Hu.

Figure 1
Figure 1. Figure 1: Human-only data filtering workflow. The exported table was filtered by exact User ID, de-duplicated, and reduced to valid human trials with interpretable trial, response, stimulus-size, and arcminute fields. Human response latency and timeout behavior The human subset contained 1,078 non-timeout responses and 76 timeouts. The timeout rate was 6.6%. Non-timeout reaction times were centered near 1.5 seconds:… view at source ↗
Figure 2
Figure 2. Figure 2: Human reaction-time distribution. Non-timeout human responses clustered around approximately 1.5 seconds, with very few responses beyond the 3-second human time budget. Automatic staircase convergence and adaptation curve The adaptation curve showed progressive movement toward smaller optotypes. Median retained stimulus size was 20 px (range, 7 to 63), and median arcminute value was 11.52 (range, 3.40 to 3… view at source ↗
Figure 3
Figure 3. Figure 3: Computerized staircase convergence across human trials. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Automatic stimulus-size transitions. Most within-session transitions moved to a smaller next stimulus, indicating stable progression toward the perceptual threshold. Human temporal-instability signal Using the directly observable delay and timeout components, the human temporal-instability index was 0.034. Timeout events contributed most of the observable instability, while delayed non-timeout responses we… view at source ↗
Figure 5
Figure 5. Figure 5: Human temporal-instability components. Timeout rate contributed most of the observable temporal instability; delay events over the 3-second human time budget were rare [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

OBJECTIVES: Visual acuity and tumbling-E tasks are often treated as static threshold measures, yet sequential perceptual decisions unfold over time. A computerized tumbling-E task preserves response latency, timeouts, and stimulus-size adaptation, creating a temporal reliability dataset rather than only a chart-line score. This matters for human-AI comparison because the Temporal Hallucination Index (THI) shows how static accuracy can obscure delays, drift, persistence, and unstable convergence. METHODS: We curated trial-level human data from a computerized dynamic tumbling-E task. On each trial, a single E optotype appeared in one of four orientations, participants selected the perceived direction or timed out, and stimulus size was automatically adjusted through an adaptive staircase. Primary outcomes were reaction time, timeout rate, delay rate above a 3-second budget, and observable THI based on delay and timeout components. RESULTS: The final dataset included 1,154 valid trials from 21 human identifiers across 77 sessions. There were 1,078 non-timeout responses and 76 timeouts, giving a 6.6% timeout rate. Non-timeout reaction times centered near 1.5 seconds (mean 1546 ms; median 1506 ms; IQR 1306-1713 ms), with only 3 responses exceeding 3,000 ms. Adaptation was dominated by smaller-next-stimulus transitions (89.2%). Mean arcminutes declined from 29.42 at trial 0 to 5.04 at trial 19, supporting convergence near a 20/20-level optotype without clinical acuity diagnosis. CONCLUSIONS: This dataset converts a tumbling-E visual task into a temporally resolved human perceptual-decision benchmark. Its novel contribution is automatic capture of staircase behavior, response timing, timeouts, and trial-level reliability signals. The human data show fast timing and smooth adaptation toward threshold, establishing a human-only baseline for future comparison with artificial agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript curates and describes a dataset of 1,154 valid trials from a dynamic computerized tumbling-E optotype task performed by 21 human participants across 77 sessions. It reports reaction times (mean 1546 ms), timeout rates (6.6%), stimulus-size adaptation via staircase, and introduces the Temporal Hallucination Index (THI) constructed from delay and timeout components as a temporally resolved reliability signal intended to serve as a human baseline for future human-AI perceptual-decision comparisons.

Significance. A well-documented, trial-level dataset with timing and adaptation metadata could provide a useful empirical reference for studies of sequential perceptual decisions. However, because the THI—the index positioned as the key novel temporal-reliability construct—is never defined, the manuscript’s central claim that the data reveal delays, drift, and unstable convergence hidden by static accuracy cannot be evaluated or reproduced.

major comments (3)
  1. [Abstract] Abstract and Results: The THI is repeatedly invoked as the observable that 'shows how static accuracy can obscure delays, drift, persistence, and unstable convergence' and as the basis for the human-AI benchmark, yet no formula, weighting of delay versus timeout components, or computational definition is supplied anywhere in the manuscript.
  2. [Results] Methods/Results: No validation of the THI is reported against established reliability metrics (test-retest, split-half, or signal-detection d'), nor is any correlation or comparison with conventional acuity scores provided; without this the claim that THI supplies a 'meaningful and comparable measure of temporal reliability' remains unsupported.
  3. [Methods] Methods: Participant recruitment, inclusion/exclusion criteria, session structure, exact staircase algorithm, and timeout threshold definition are not described, preventing independent assessment of the reported counts (1,154 valid trials, 76 timeouts) and adaptation statistics.
minor comments (1)
  1. [Abstract] The abstract states 'Mean arcminutes declined from 29.42 at trial 0 to 5.04 at trial 19' but does not specify whether these values are means across participants or sessions, or how trial numbering is aligned across variable-length staircases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our manuscript. We address each of the major comments below and plan to revise the manuscript to incorporate the necessary clarifications and additions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Results: The THI is repeatedly invoked as the observable that 'shows how static accuracy can obscure delays, drift, persistence, and unstable convergence' and as the basis for the human-AI benchmark, yet no formula, weighting of delay versus timeout components, or computational definition is supplied anywhere in the manuscript.

    Authors: We agree that the manuscript does not provide an explicit formula for the THI. The THI is described as being based on delay and timeout components, but the specific computational definition, including any weighting, was inadvertently omitted. We will include a clear mathematical definition of the THI in the revised Methods section. revision: yes

  2. Referee: [Results] Methods/Results: No validation of the THI is reported against established reliability metrics (test-retest, split-half, or signal-detection d'), nor is any correlation or comparison with conventional acuity scores provided; without this the claim that THI supplies a 'meaningful and comparable measure of temporal reliability' remains unsupported.

    Authors: The primary aim of this manuscript is to curate and describe the dataset as a baseline for future human-AI comparisons, rather than to fully validate the THI metric here. However, we recognize that demonstrating its relation to established metrics would enhance the paper. Since the dataset includes multiple sessions per participant, we will add test-retest reliability analyses and correlations with acuity measures in the revised version where feasible. revision: partial

  3. Referee: [Methods] Methods: Participant recruitment, inclusion/exclusion criteria, session structure, exact staircase algorithm, and timeout threshold definition are not described, preventing independent assessment of the reported counts (1,154 valid trials, 76 timeouts) and adaptation statistics.

    Authors: We acknowledge that these methodological details are missing from the current manuscript. We will expand the Methods section to fully describe participant recruitment and inclusion/exclusion criteria, session structure, the exact staircase algorithm used for stimulus-size adaptation, and the definition of the timeout threshold. This will enable readers to independently verify the reported trial counts and statistics. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical dataset report with no derivation or fitted predictions

full rationale

The manuscript presents summary statistics from 1,154 human trials in a tumbling-E task, including reaction times, timeout rates, and adaptation behavior. No equations, parameter fitting, predictions, or first-principles derivations are described that could reduce to inputs by construction. The THI is mentioned as an observable based on delay and timeout components but is not used in any load-bearing derivation or self-referential definition within the provided text. All reported outcomes are direct empirical aggregates, making the work self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The work is purely empirical data collection. No mathematical derivations, free parameters, or new physical entities are introduced in the abstract. The THI is referenced but not formalized.

invented entities (1)
  • Temporal Hallucination Index (THI) no independent evidence
    purpose: To quantify temporal reliability via delay and timeout components for human-AI comparison
    Mentioned in the abstract as an observable index but without explicit definition or independent validation data.

pith-pipeline@v0.9.1-grok · 5887 in / 1224 out tokens · 22410 ms · 2026-06-26T11:24:32.547349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 1 canonical work pages

  1. [1]

    Division of Translational Neuroscience, Department of Clinical Neurosciences, Hotchkiss Brain Institute, University of Calgary, Calgary, AB T2N 4N1, Canada To whom corresponding should be addressed: Professor Bin Hu MD. Ph.D. Suter Professor for Parkinson’s Disease Research Founder and Director Open Digital Health (OpenDH) Program of University of Calgary...

  2. [2]

    Applying new design principles to the construction of an illiterate E chart

    Taylor HR. Applying new design principles to the construction of an illiterate E chart. Am J Optom Physiol Opt. 1978;55(5):348-351

  3. [3]

    New design principles for visual acuity letter charts

    Bailey IL, Lovie JE. New design principles for visual acuity letter charts. Am J Optom Physiol Opt. 1976;53(11):740-745

  4. [4]

    New visual acuity charts for clinical research

    Ferris FL 3rd, Kassoff A, Bresnick GH, Bailey I. New visual acuity charts for clinical research. Am J Ophthalmol. 1982;94(1):91-96

  5. [5]

    Visual acuity testing

    Bailey IL, Lovie-Kitchin JE. Visual acuity testing. From the laboratory to the clinic. Vision Res. 2013;90:2-9

  6. [6]

    Illiterate E visual acuity charts: historical and technical overview

    Shute DV. Illiterate E visual acuity charts: historical and technical overview. Ophthalmic Physiol Opt. 2002;22(4):271-279

  7. [7]

    Making sense of visual acuity data

    Johnston AW. Making sense of visual acuity data. Clin Exp Optom. 1991;74(5):146-154

  8. [8]

    The effect of structural layout on the reliability of letter charts

    Rosser DA, Murdoch IE, Cousens SN. The effect of structural layout on the reliability of letter charts. Invest Ophthalmol Vis Sci. 2004;45(9):3060-3065

  9. [9]

    The Freiburg Visual Acuity Test - automatic measurement of visual acuity

    Bach M. The Freiburg Visual Acuity Test - automatic measurement of visual acuity. Optom Vis Sci. 1996;73(1):49-53

  10. [10]

    Computer-based quantification of cortical visual function

    Strasburger H. Computer-based quantification of cortical visual function. Front Human Neurosci. 2011;5:117

  11. [11]

    Transformed up-down methods in psychoacoustics

    Levitt H. Transformed up-down methods in psychoacoustics. J Acoust Soc Am. 1971;49(2B):467-477

  12. [12]

    QUEST: a Bayesian adaptive psychometric method

    Watson AB, Pelli DG. QUEST: a Bayesian adaptive psychometric method. Percept Psychophys. 1983;33(2):113-120

  13. [13]

    Adaptive psychophysical procedures

    Treutwein B. Adaptive psychophysical procedures. Vision Res. 1995;35(17):2503-2522

  14. [14]

    Criteria for the choice and evaluation of psychophysical telemetry benchmarks

    Gorea A, Caetta F, Sagi D. Criteria for the choice and evaluation of psychophysical telemetry benchmarks. J Vision. 2005;5(4):311-324

  15. [15]

    The Temporal Hallucination Index (THI): A framework for quantifying sequential instabilities in human and artificial vision

    Sandhu A, Hu B. The Temporal Hallucination Index (THI): A framework for quantifying sequential instabilities in human and artificial vision. Canadian Open Digital Health Protocols. 2025;3(1):45-58. 21

  16. [16]

    Temporal consistency and drift evaluation frameworks for computer vision systems

    Zhang Y, Hughes JW, Ermon S. Temporal consistency and drift evaluation frameworks for computer vision systems. IEEE Trans Pattern Anal Mach Intell. 2024;46(2):1012-1025

  17. [17]

    Survey of hallucination in large vision-language models

    Ji Ziwei, Lee N, Frieske R, et al. Survey of hallucination in large vision-language models. ACM Comput Surv. 2024;56(8):1-34

  18. [18]

    Beyond accuracy: Evaluating the temporal stability of sequential deep neural net classifications

    Wang F, Zhou L, Science S. Beyond accuracy: Evaluating the temporal stability of sequential deep neural net classifications. Int J Comput Vis. 2025;133(3):412-429

  19. [19]

    A theory of memory retrieval

    Ratcliff R. A theory of memory retrieval. Psychol Rev. 1978;85(2):59-108

  20. [20]

    A comparison of sequential sampling models for two-choice reaction time

    Ratcliff R, Smith PL. A comparison of sequential sampling models for two-choice reaction time. Psychol Rev. 2004;111(2):333-367

  21. [21]

    The effect of stimulus intensity on the response time in a visual discrimination task

    Palmer J, Huk AC, Shadlen MN. The effect of stimulus intensity on the response time in a visual discrimination task. J Vision. 2005;5(5):376-404

  22. [22]

    The neural basis of decision making

    Gold JI, Shadlen MN. The neural basis of decision making. Annu Rev Neurosci. 2007;30:535-574

  23. [23]

    Speed-accuracy tradeoff and visual signal detection

    Wickelgren WA. Speed-accuracy tradeoff and visual signal detection. Percept Psychophys. 1977;22(2):119-124

  24. [24]

    Sequential sampling models in cognitive neuroscience: Advantages, applications, and extensions

    Forstmann BU, Ratcliff R, Wagenmakers EJ. Sequential sampling models in cognitive neuroscience: Advantages, applications, and extensions. Annu Rev Psychol. 2016;67:641-666

  25. [25]

    Automated computerized near visual acuity assessment: Challenges of calibration and viewing distance control

    Bennett CR, Bex PJ. Automated computerized near visual acuity assessment: Challenges of calibration and viewing distance control. Front Digit Health. 2021;3:698502

  26. [26]

    ImageNet-trained CNNs are biased towards texture; increasing shape bias improves robustness and human relevance

    Geirhos R, Rubisch P, Michaelis C, et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves robustness and human relevance. arXiv preprint arXiv:1811.12231

  27. [27]

    Segregation of form, color, movement, and depth: Anatomy, physiology, and perception

    Livingstone MS, Hubel DH. Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science. 1988;240(4853):740-749

  28. [28]

    Strike (a pose): Neural networks are easily fooled by strange object poses

    Alcorn MA, Li L, Zhang L, et al. Strike (a pose): Neural networks are easily fooled by strange object poses. CVPR. 2019;4845-4854

  29. [29]

    Cognition does not affect perception: Evaluating the evidence for 'top-down' effects

    Firestone C, Scholl BJ. Cognition does not affect perception: Evaluating the evidence for 'top-down' effects. Behav Brain Sci. 2016;39:e229

  30. [30]

    Rapid and objective measurement of visual acuity function

    Dorr M, Lesmes LA, Lu ZL, Bex PJ. Rapid and objective measurement of visual acuity function. Invest Ophthalmol Vis Sci. 2013;54(13):8112-8119