Recognition: unknown
High-Risk Memories? Comparative audit of the representation of Second World War atrocities in Ukraine by generative AI applications
Pith reviewed 2026-05-10 12:31 UTC · model grok-4.3
The pith
Generative AI applications risk misrepresenting Ukrainian WWII atrocities through factual errors and selective moralization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By auditing three common genAI applications on queries about Ukrainian WWII atrocities, the authors find evidence of hallucinations, inconsistent moralization, and other forms of misrepresentation that could affect how high-risk historical memories are preserved and understood.
What carries the argument
The comparative audit of three generative AI applications using prompts on Second World War atrocities in Ukraine, evaluating outputs for factual accuracy, group representation, and moral consistency.
If this is right
- AI content on historical atrocities may include fabricated details or events.
- Outputs could unfairly target or mischaracterize specific ethnic or national groups.
- Selective emphasis on certain aspects might promote one-sided views of history.
- Future memory practices may rely more on AI-generated narratives that require verification.
Where Pith is reading between the lines
- Similar risks likely apply to other contested historical events beyond Ukraine.
- Users of genAI for historical research should always verify outputs with primary sources.
- This points to a broader need for ethical guidelines in training AI on sensitive topics.
Load-bearing premise
That the responses from three common genAI applications to prompts about Ukrainian WWII atrocities reflect the general behavior of generative AI toward high-risk historical memories.
What would settle it
A systematic test showing that multiple genAI models consistently produce accurate, balanced, and non-hallucinated accounts of the Ukrainian WWII atrocities without selective moralization.
read the original abstract
The rise of generative artificial intelligence (genAI) models poses new possibilities and risks for how the past is remembered by accelerating content production and altering the process of information discovery. The most critical risk is historical misrepresentation, which ranges from the distortion of facts and inaccurate depiction of specific groups to more subtle forms, such as the selective moralization of history. The dangers of misrepresentation of the past are particularly pronounced for high-risk memories, such as memories of past atrocities, which have a strong emotional load and are often instrumentalised by political actors. To understand how substantive this risk is, we empirically investigate how genAI applications deal with high-risk memories of the Second World War atrocities in Ukraine. This case is crucial due to the scope of the atrocities and the intense, often instrumentalised, contestation surrounding their memory. We audit the performance of three common genAI applications for different types of misrepresentation, including hallucinations and inconsistent moralization, and discuss the implications for future memory practices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that generative AI applications pose risks of historical misrepresentation when handling high-risk memories, particularly those of Second World War atrocities in Ukraine. It empirically audits three common genAI applications for issues including hallucinations, fact distortion, inaccurate depictions of groups, and inconsistent or selective moralization, arguing that the Ukrainian case is crucial due to the scale of atrocities and ongoing political contestation, with implications for future memory practices.
Significance. If the audit were fully reported with transparent methods, prompts, outputs, and systematic evaluation, the work could usefully document concrete patterns of misrepresentation in current genAI tools on a politically sensitive historical topic. This would add to the literature on AI and collective memory by providing an empirical baseline for one high-stakes case. However, the absence of any methods, data, or results in the provided manuscript means the claimed findings cannot yet be assessed or generalized.
major comments (3)
- [Abstract] Abstract: The text states that the authors 'empirically investigate' and 'audit the performance' of three genAI applications for hallucinations and inconsistent moralization, yet supplies no description of the applications tested, the prompts used, the sample size or selection of queries, the coding scheme for misrepresentation, or any observed outputs/results. Without these elements it is impossible to evaluate whether the audit supports the central claim that the risk is 'substantive'.
- [Introduction / Discussion] The manuscript generalizes from a single national-historical case (Ukrainian WWII atrocities) and three unspecified applications to statements about 'high-risk memories' in general. No sampling rationale, comparison to other architectures/training regimes, or discussion of why these three tools suffice for representativeness is provided, so the observed patterns (whatever they may be) cannot be treated as typical of the model class.
- [Discussion] No limitations section or explicit discussion of scope appears; the work therefore offers no basis for assessing external validity, prompt sensitivity, or temporal stability of any findings once models are updated.
minor comments (1)
- [Abstract] The abstract refers to 'selective moralization of history' without defining the term or distinguishing it from other forms of bias; a brief operational definition would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that greater methodological transparency, clearer scoping of claims, and an explicit limitations discussion are necessary. We have undertaken a major revision to address each point, expanding the methods, qualifying the generalizability of the case study, and adding a dedicated limitations section. Below we respond to the major comments point by point.
read point-by-point responses
-
Referee: [Abstract] Abstract: The text states that the authors 'empirically investigate' and 'audit the performance' of three genAI applications for hallucinations and inconsistent moralization, yet supplies no description of the applications tested, the prompts used, the sample size or selection of queries, the coding scheme for misrepresentation, or any observed outputs/results. Without these elements it is impossible to evaluate whether the audit supports the central claim that the risk is 'substantive'.
Authors: We accept this criticism. The original submission did not include sufficient detail on the audit procedure. In the revised manuscript we have added a dedicated methods section that specifies the three applications, the complete prompt set, the sample size and selection criteria for queries (focused on documented WWII atrocities in Ukraine), the coding framework for detecting hallucinations, factual distortions, group misrepresentations, and selective moralization, and representative outputs with our annotations. These materials are also provided in full in a new appendix to enable independent evaluation. revision: yes
-
Referee: [Introduction / Discussion] The manuscript generalizes from a single national-historical case (Ukrainian WWII atrocities) and three unspecified applications to statements about 'high-risk memories' in general. No sampling rationale, comparison to other architectures/training regimes, or discussion of why these three tools suffice for representativeness is provided, so the observed patterns (whatever they may be) cannot be treated as typical of the model class.
Authors: We agree that the original text did not sufficiently delimit the scope of generalization. The revised introduction now explicitly frames the work as a focused case study of one high-risk memory domain, explains the rationale for selecting the Ukrainian WWII atrocities (scale of events and ongoing political contestation), and justifies the choice of the three consumer-facing applications as leading examples of current generative systems. The discussion has been updated to state that the observed patterns should not be assumed to hold for all architectures, training regimes, or other high-risk memories, and we call for comparative studies. revision: yes
-
Referee: [Discussion] No limitations section or explicit discussion of scope appears; the work therefore offers no basis for assessing external validity, prompt sensitivity, or temporal stability of any findings once models are updated.
Authors: We acknowledge the absence of an explicit limitations discussion. The revised manuscript now contains a new 'Limitations' section that addresses the study's scope, the sensitivity of outputs to prompt phrasing, the rapid evolution of models that may render findings time-bound, and the restricted external validity beyond the chosen case and tools. This section also notes that the audit provides a snapshot rather than a definitive characterization of all generative AI behavior. revision: yes
Circularity Check
No circularity: empirical audit with no derivations or self-referential reductions
full rationale
The paper conducts a direct empirical audit of three generative AI applications using prompts on Ukrainian WWII atrocities, evaluating outputs for hallucinations, factual distortions, and inconsistent moralization. No equations, parameters, or derivation chains exist that could reduce claims to inputs by construction. The methodology relies on explicit testing and qualitative analysis of AI responses rather than any fitted inputs renamed as predictions or self-citations that bear the load of the central findings. The assumption that the three applications provide insight into high-risk memory handling is a standard methodological choice for a case study, not a circular self-definition, and the results remain grounded in observable outputs without importing uniqueness theorems or ansatzes from prior self-work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-risk memories of past atrocities have strong emotional load and are often instrumentalised by political actors
Reference graph
Works this paper leans on
-
[1]
to Kremlin disinformation reflect information gaps, not manipulation, in: Harvard Kennedy School Misinformation Review , 6 (5), 1–24. Ammari, Tawfiq/Chen, Meilun/Zaman, S. M./Garimella, Kiran. (2025): How students (really) use ChatGPT: Uncovering experiences among undergraduate students, https://arxiv.org/abs/2505.24126, last access: 15.2.2026. Amichay, R...
-
[2]
Makhortykh, Mykola/Sydorova, Maryna (2026): In Search of the North Star for AI and Holocaust Memory, in Digital Memory Dialogue s, 2, 1 –
2026
-
[3]
Menotti, Gabriel (2025): The model is the museum: Generative AI and the expropriation of cultural heritage, in: AI & SOCIETY , 1–5. Merrill, Samuel/Makhortykh, Mykola/Mandolessi, Silvana/Richardson-Walden, Victoria Grace/Smit, Rik/Wang, Qi (2025): Handling the hype: Demystifying artificial intelligence for memory studies, in: Memory, Mind & Media , 4, 1–1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.