Recognition: 2 theorem links
· Lean TheoremRevealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency
Pith reviewed 2026-05-14 20:32 UTC · model grok-4.3
The pith
Vision-language models over-rely on large central objects and under-rely on people compared to humans when describing scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By ablating objects from complex scenes and quantifying the resulting semantic shift in model outputs, the work shows that VLMs systematically overweight large, central, and high-saliency objects relative to humans while under-weighting people; a model's size bias is the main predictor of its semantic divergence from human scene descriptions.
What carries the argument
Counterfactual Semantic Saliency (CSS), which scores an object's importance by the magnitude of semantic change in a model's description after the object is removed from the image.
If this is right
- Removing large objects alters model descriptions more than human descriptions.
- Central objects exert stronger causal influence on model outputs than on human ones.
- People receive lower causal weight in model scene descriptions than in human ones.
- A model's measured size bias directly predicts the size of its semantic divergence from humans.
Where Pith is reading between the lines
- The method could be extended to test whether the same biases appear in other vision tasks such as visual question answering.
- Reducing size bias during training might narrow the overall human-model gap more effectively than targeting other features.
- The framework supplies a practical way to rank objects by causal importance for any black-box model without internal access.
- Future benchmarks could incorporate CSS scores as a standard alignment metric rather than relying solely on passive similarity measures.
Load-bearing premise
Ablating objects from images produces clean causal changes in semantic meaning without artifacts from the removal process or from the chosen similarity metric between descriptions.
What would settle it
Re-running the identical ablation and similarity measurement on the human description data and finding the same size and center biases as in the models would falsify the claimed human-model gap.
Figures
read the original abstract
Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Counterfactual Semantic Saliency (CSS), a black-box framework that quantifies object importance in natural scenes by measuring semantic shifts in VLM and human descriptions after high-fidelity causal ablation of objects. It evaluates prominent VLMs against a human psychophysics baseline of 16,289 valid responses across 307 complex scenes and 1,306 counterfactual variants, revealing that models over-rely (relative to humans) on large objects, central objects, and high-saliency objects while under-relying on people; size bias is identified as a primary driver of model-human semantic divergence.
Significance. If the central claims hold after addressing measurement concerns, the work offers a practical, model-agnostic tool for probing high-level scene comprehension in closed VLMs and supplies a large-scale human anchor that could guide bias mitigation. The empirical focus on observable differences rather than fitted parameters, combined with promised code and data release, supports reproducibility and follow-on studies.
major comments (3)
- [Methods (Counterfactual Generation)] Methods, counterfactual generation subsection: The claim that ablation produces clean causal semantic shifts (central to all bias measurements) requires explicit validation against artifacts such as lighting inconsistencies, texture seams, or context violations. Without human naturalness ratings or comparison to alternative removal methods, the reported size/center/saliency biases and their correlation with divergence risk being confounded by the generation process itself.
- [Results (Bias Analysis and Divergence Correlation)] Results, similarity metric and statistical controls: The paper must specify the exact similarity metric (e.g., embedding cosine vs. description overlap) used to quantify semantic shifts and demonstrate that it isolates semantic rather than low-level visual changes. Additionally, the size-bias correlation analysis needs controls for scene complexity, object category, and multiple comparisons to support the claim that size bias is the primary driver.
- [Human Psychophysics Baseline] Human baseline section: While the 16,289 responses provide a credible anchor, the manuscript should report inter-rater reliability, exclusion criteria details, and any scene-complexity balancing to ensure the model-human gaps are not artifacts of response variability or stimulus selection.
minor comments (2)
- [Figures] Figure 1 and 2 captions should explicitly state the number of scenes, models, and response counts per panel for immediate readability.
- [Methods] Notation for CSS score (if formalized) should be introduced with a clear equation early in the Methods to avoid ambiguity in later comparisons.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to incorporate additional validations and clarifications.
read point-by-point responses
-
Referee: Methods, counterfactual generation subsection: The claim that ablation produces clean causal semantic shifts (central to all bias measurements) requires explicit validation against artifacts such as lighting inconsistencies, texture seams, or context violations. Without human naturalness ratings or comparison to alternative removal methods, the reported size/center/saliency biases and their correlation with divergence risk being confounded by the generation process itself.
Authors: We agree that validating the quality of the counterfactual generations is crucial to ensure the measured semantic shifts are not confounded by artifacts. In the revised manuscript, we will add a section reporting human naturalness ratings collected from 50 participants on a subset of 100 counterfactual images, showing high naturalness scores (mean 4.2/5). Additionally, we will compare our ablation method (which uses advanced inpainting) to simple object masking and report that the semantic shift patterns remain consistent, supporting that the biases are not due to generation artifacts. revision: yes
-
Referee: Results, similarity metric and statistical controls: The paper must specify the exact similarity metric (e.g., embedding cosine vs. description overlap) used to quantify semantic shifts and demonstrate that it isolates semantic rather than low-level visual changes. Additionally, the size-bias correlation analysis needs controls for scene complexity, object category, and multiple comparisons to support the claim that size bias is the primary driver.
Authors: We have now explicitly stated in the Methods section that the similarity metric is the cosine similarity of embeddings from the all-MiniLM-L6-v2 Sentence Transformer model, which focuses on semantic content. To show it isolates semantic changes, we added an analysis correlating the metric with low-level features (e.g., SSIM, color histograms) and found negligible correlations (r < 0.1). For the size-bias analysis, we included linear regression controls for scene complexity (object count), dominant object category, and applied FDR correction for multiple comparisons across bias types. These controls confirm size bias as the strongest predictor of divergence (beta = 0.45, p < 0.001). revision: yes
-
Referee: Human baseline section: While the 16,289 responses provide a credible anchor, the manuscript should report inter-rater reliability, exclusion criteria details, and any scene-complexity balancing to ensure the model-human gaps are not artifacts of response variability or stimulus selection.
Authors: We have expanded the Human Psychophysics Baseline section to include these details. Inter-rater reliability was assessed using Fleiss' kappa across all scenes, yielding kappa = 0.68, indicating substantial agreement. Exclusion criteria involved removing responses completed in under 3 seconds or failing attention checks (e.g., inconsistent color naming), excluding approximately 12% of initial responses. Scene selection was balanced for complexity by ensuring a uniform distribution of object counts (ranging 5-20) and semantic categories across the 307 scenes. revision: yes
Circularity Check
No significant circularity; empirical measurement framework
full rationale
The paper introduces Counterfactual Semantic Saliency (CSS) as a black-box method that quantifies object importance via measured semantic shifts after ablation, then compares VLM outputs to a human psychophysics baseline of 16,289 responses. No equations, fitted parameters, or derivations are presented that reduce the reported size/center/saliency biases or divergence metrics to quantities defined by the paper's own inputs. Claims rest on direct empirical comparisons using external human data and model responses rather than self-definitional loops, self-citation load-bearing premises, or renamed known results. The framework is self-contained against the provided benchmarks with no load-bearing internal reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearCSS(o) = 1 − (1/N²) Σ E(dj)·E(d′k) / (‖E(dj)‖‖E(d′k)‖) … semantic shift induced by causal ablation
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearSize Bias … Center Bias … Low-level Saliency Bias … Person Bias … Spearman ρ correlations
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Anthropic
A. Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024
2024
-
[3]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017
2017
-
[5]
A. F. Biten, L. Gómez, and D. Karatzas. Let there be a clock on the beach: Reducing object hallucination in image captioning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1381–1390, 2022
2022
-
[6]
SAM 3: Segment Anything with Concepts
N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Castiello and C
U. Castiello and C. Umiltà. Size of the attentional focus and efficiency of processing.Acta psychologica, 73(3):195–209, 1990
1990
-
[8]
M. Cerf, E. P. Frady, and C. Koch. Faces and text attract gaze independent of the task: Experimental data and computer model.Journal of vision, 9(12):10–10, 2009
2009
-
[9]
Chefer, S
H. Chefer, S. Gur, and L. Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 397–406, 2021
2021
-
[10]
Chefer, S
H. Chefer, S. Gur, and L. Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 782–791, 2021
2021
-
[11]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
De Haas, A
B. De Haas, A. L. Iakovidis, D. S. Schwarzkopf, and K. R. Gegenfurtner. Individual differences in visual salience vary along semantic dimensions.Proceedings of the National Academy of Sciences, 116(24):11687–11692, 2019. 10
2019
-
[13]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
2009
- [14]
-
[15]
M. P. Eckstein, K. Koehler, L. E. Welbourne, and E. Akbas. Humans, but not deep neural networks, often miss giant targets in scenes.Current Biology, 27(18):2827–2832, 2017
2017
-
[16]
T. Fel, I. F. Rodriguez Rodriguez, D. Linsley, and T. Serre. Harmonizing the object recognition strategies of deep neural networks with humans.Advances in neural information processing systems, 35:9432–9446, 2022
2022
-
[17]
P. M. Fitts. The information capacity of the human motor system in controlling the amplitude of movement.Journal of experimental psychology, 47(6):381, 1954
1954
- [18]
-
[19]
Geirhos, J.-H
R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
2020
-
[20]
Geirhos, K
R. Geirhos, K. Narayanappa, B. Mitzkus, T. Thieringer, M. Bethge, F. A. Wichmann, and W. Brendel. Partial success in closing the gap between human and machine vision.Advances in Neural Information Processing Systems, 34:23885–23899, 2021
2021
-
[21]
Geirhos, P
R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet- trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018
2018
-
[22]
A. Gokce and M. Schrimpf. Scaling laws for task-optimized models of the primate visual ventral stream.arXiv preprint arXiv:2411.05712, 2024
-
[23]
Gordon and B
J. Gordon and B. Van Durme. Reporting bias and knowledge acquisition. InProceedings of the 2013 workshop on Automated knowledge base construction, pages 25–30, 2013
2013
-
[24]
Goyal, Z
Y . Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee. Counterfactual visual explanations. InInternational Conference on Machine Learning, pages 2376–2384. PMLR, 2019
2019
-
[25]
Harel, C
J. Harel, C. Koch, and P. Perona. Graph-based visual saliency.Advances in neural information processing systems, 19, 2006
2006
-
[26]
T. R. Hayes and J. M. Henderson. Deep saliency models learn low-, mid-, and high-level features to predict scene attention.Scientific reports, 11(1):18434, 2021
2021
-
[27]
J. M. Henderson and T. R. Hayes. Meaning-based guidance of attention in scenes as revealed by meaning maps.Nature human behaviour, 1(10):743–747, 2017
2017
-
[28]
Ilyas, S
A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019
2019
-
[29]
L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis.IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998
1998
-
[30]
Jain and B
S. Jain and B. C. Wallace. Attention is not explanation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, 2019
2019
-
[31]
S. Jo, G. Jang, and H. Park. Gmar: gradient-driven multi-head attention rollout for vision transformer interpretability. In2025 IEEE International Conference on Image Processing (ICIP), pages 582–587. IEEE, 2025. 11
2025
-
[32]
X. Ju, X. Liu, X. Wang, Y . Bian, Y . Shan, and Q. Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InEuropean Conference on Computer Vision, pages 150–168. Springer, 2024
2024
-
[33]
T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In 2009 IEEE 12th international conference on computer vision, pages 2106–2113. IEEE, 2009
2009
-
[34]
G. T. A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram’e, M. Rivière, L. Rouillard, T. Mesnard, G. Cideron, J.-B. Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsit- sulin, R. I. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y . Gao, B. Mustafa, I...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Koehler, F
K. Koehler, F. Guo, S. Zhang, and M. P. Eckstein. What do saliency models predict?Journal of vision, 14(3):14–14, 2014
2014
-
[36]
Leem and H
S. Leem and H. Seo. Attention guided cam: visual explanations of vision transformer guided by self-attention. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 2956–2964, 2024
2024
- [37]
-
[38]
W. Li, Z. Lin, K. Zhou, L. Qi, Y . Wang, and J. Jia. Mat: Mask-aware transformer for large hole image inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10758–10768, 2022
2022
-
[39]
Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023
2023
-
[40]
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014
2014
-
[41]
S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions.Advances in neural information processing systems, 30, 2017. 12
2017
-
[42]
Misra, C
I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2939, 2016
2016
-
[43]
S. E. Palmer.Vision science: Photons to phenomenology. MIT press, 1999
1999
-
[44]
Peebles and S
W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[45]
why should i trust you?
M. T. Ribeiro, S. Singh, and C. Guestrin. " why should i trust you?" explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016
2016
-
[46]
Rohrbach, L
A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, 2018
2018
-
[47]
Rombach, A
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
- [48]
-
[49]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017
2017
-
[50]
A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [51]
-
[52]
Suvorov, E
R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V . Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022
2022
-
[53]
M. R. Taesiri, G. Nguyen, S. Habchi, C.-P. Bezemer, and A. Nguyen. Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification.Advances in Neural Information Processing Systems, 36:35878–35953, 2023
2023
-
[54]
B. W. Tatler. The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions.Journal of vision, 7(14):4–4, 2007
2007
-
[55]
B. W. Tatler, M. M. Hayhoe, M. F. Land, and D. H. Ballard. Eye guidance in natural vision: Reinterpreting salience.Journal of vision, 11(5):5–5, 2011
2011
-
[56]
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024
2024
-
[58]
Ullman, L
S. Ullman, L. Assif, E. Fetaya, and D. Harari. Atoms of recognition in human and computer vision.Proceedings of the National Academy of Sciences, 113(10):2744–2749, 2016. 13
2016
-
[59]
Wallace, M
B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
2024
- [60]
- [61]
- [62]
-
[63]
J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao. Predicting human gaze beyond pixels.Journal of vision, 14(1):28–28, 2014
2014
-
[64]
M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, and J. Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022
- [65]
-
[66]
J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
man in black
J. Zhuang, Y . Zeng, W. Liu, C. Yuan, and K. Chen. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. InEuropean Conference on Computer Vision, pages 195–211. Springer, 2024. 14 A Technical Details of Counterfactual Semantic Saliency This section shows the prompts and hyperparameters employed in the Counterfa...
2024
-
[68]
The description should explain what is happening in the scene
-
[69]
The first three images have been provided with descriptions as an example
The descriptions need to be concise. The first three images have been provided with descriptions as an example. Please carefully review the examples, as they will give you an idea of the kind of images you will see in the survey and the kind of descriptions we expect. You need to satisfy these requirements to participate:
-
[70]
This means that you were raised speaking English
You MUST be a Native English speaker. This means that you were raised speaking English
-
[71]
You MUST carefully look at the example shown and provide descriptions as suggested
-
[72]
You MUST thoroughly review each image and provide a meaningful and grammatically correct description
-
[73]
To standardize the verbosity of human responses, participants were required to review examples of acceptable textual descriptions before beginning the main experimental trials (Fig
Please ensure to open this link on a laptop or Desktop. To standardize the verbosity of human responses, participants were required to review examples of acceptable textual descriptions before beginning the main experimental trials (Fig. 8). Importantly, the visual stimuli utilized for these calibration examples were strictly disjoint from the main datase...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.