arxiv: 2604.13765 · v1 · submitted 2026-04-15 · 💻 cs.CY

Recognition: unknown

High-Risk Memories? Comparative audit of the representation of Second World War atrocities in Ukraine by generative AI applications

Mykola Makhortykh , Victoria Vziatysheva , Maryna Sydorova

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:31 UTC · model grok-4.3

classification 💻 cs.CY

keywords generative AIhistorical memoryWWII atrocitiesUkraineAI auditmisrepresentationhigh-risk memories

0 comments

The pith

Generative AI applications risk misrepresenting Ukrainian WWII atrocities through factual errors and selective moralization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how three popular generative AI tools handle prompts related to Second World War atrocities in Ukraine. It identifies patterns of historical misrepresentation, including fact distortion, inaccurate group depictions, and inconsistent moral framing. This is important because these memories are emotionally charged and often used politically, and AI could spread misleading versions widely. The audit highlights how genAI might alter public understanding of contested past events.

Core claim

By auditing three common genAI applications on queries about Ukrainian WWII atrocities, the authors find evidence of hallucinations, inconsistent moralization, and other forms of misrepresentation that could affect how high-risk historical memories are preserved and understood.

What carries the argument

The comparative audit of three generative AI applications using prompts on Second World War atrocities in Ukraine, evaluating outputs for factual accuracy, group representation, and moral consistency.

If this is right

AI content on historical atrocities may include fabricated details or events.
Outputs could unfairly target or mischaracterize specific ethnic or national groups.
Selective emphasis on certain aspects might promote one-sided views of history.
Future memory practices may rely more on AI-generated narratives that require verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar risks likely apply to other contested historical events beyond Ukraine.
Users of genAI for historical research should always verify outputs with primary sources.
This points to a broader need for ethical guidelines in training AI on sensitive topics.

Load-bearing premise

That the responses from three common genAI applications to prompts about Ukrainian WWII atrocities reflect the general behavior of generative AI toward high-risk historical memories.

What would settle it

A systematic test showing that multiple genAI models consistently produce accurate, balanced, and non-hallucinated accounts of the Ukrainian WWII atrocities without selective moralization.

read the original abstract

The rise of generative artificial intelligence (genAI) models poses new possibilities and risks for how the past is remembered by accelerating content production and altering the process of information discovery. The most critical risk is historical misrepresentation, which ranges from the distortion of facts and inaccurate depiction of specific groups to more subtle forms, such as the selective moralization of history. The dangers of misrepresentation of the past are particularly pronounced for high-risk memories, such as memories of past atrocities, which have a strong emotional load and are often instrumentalised by political actors. To understand how substantive this risk is, we empirically investigate how genAI applications deal with high-risk memories of the Second World War atrocities in Ukraine. This case is crucial due to the scope of the atrocities and the intense, often instrumentalised, contestation surrounding their memory. We audit the performance of three common genAI applications for different types of misrepresentation, including hallucinations and inconsistent moralization, and discuss the implications for future memory practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper audits three genAI apps on prompts about Ukrainian WWII atrocities and flags misrepresentation risks, but the narrow sample leaves the broader claims about high-risk memories under-supported.

read the letter

The main takeaway is that this work runs a targeted check on how three common generative AI tools handle prompts tied to Second World War atrocities in Ukraine. It looks for factual slips, group distortions, and uneven moral framing in the outputs, which is a reasonable way to test AI on a politically charged historical topic. The case itself is well chosen because those memories are contested and carry real emotional and political weight right now. That gives the audit a concrete anchor instead of staying abstract. The breakdown of misrepresentation types also helps make the exercise operational rather than vague. What the paper does well is keep the focus on one high-stakes example and treat the outputs as something that can be examined directly rather than assumed to be neutral. The soft spot is the limited reach. Three applications do not automatically stand in for the wider range of models, training data, or alignment choices out there, and the abstract gives no clear sampling logic for why these three are enough to speak about genAI in general. Without visible details on the exact prompts, the volume of outputs checked, or how the issues were counted, it is difficult to gauge how common or severe the problems actually are. This is the sort of paper that would interest people working on AI and collective memory or on bias in historical content. A reader already thinking about how generative tools shape public narratives on contested events could pull useful framing from it, even if the evidence stays preliminary. I would send it for peer review because the empirical intent is sound and the topic matters, though the authors would need to tighten the methods and scope before it could carry stronger conclusions.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that generative AI applications pose risks of historical misrepresentation when handling high-risk memories, particularly those of Second World War atrocities in Ukraine. It empirically audits three common genAI applications for issues including hallucinations, fact distortion, inaccurate depictions of groups, and inconsistent or selective moralization, arguing that the Ukrainian case is crucial due to the scale of atrocities and ongoing political contestation, with implications for future memory practices.

Significance. If the audit were fully reported with transparent methods, prompts, outputs, and systematic evaluation, the work could usefully document concrete patterns of misrepresentation in current genAI tools on a politically sensitive historical topic. This would add to the literature on AI and collective memory by providing an empirical baseline for one high-stakes case. However, the absence of any methods, data, or results in the provided manuscript means the claimed findings cannot yet be assessed or generalized.

major comments (3)

[Abstract] Abstract: The text states that the authors 'empirically investigate' and 'audit the performance' of three genAI applications for hallucinations and inconsistent moralization, yet supplies no description of the applications tested, the prompts used, the sample size or selection of queries, the coding scheme for misrepresentation, or any observed outputs/results. Without these elements it is impossible to evaluate whether the audit supports the central claim that the risk is 'substantive'.
[Introduction / Discussion] The manuscript generalizes from a single national-historical case (Ukrainian WWII atrocities) and three unspecified applications to statements about 'high-risk memories' in general. No sampling rationale, comparison to other architectures/training regimes, or discussion of why these three tools suffice for representativeness is provided, so the observed patterns (whatever they may be) cannot be treated as typical of the model class.
[Discussion] No limitations section or explicit discussion of scope appears; the work therefore offers no basis for assessing external validity, prompt sensitivity, or temporal stability of any findings once models are updated.

minor comments (1)

[Abstract] The abstract refers to 'selective moralization of history' without defining the term or distinguishing it from other forms of bias; a brief operational definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that greater methodological transparency, clearer scoping of claims, and an explicit limitations discussion are necessary. We have undertaken a major revision to address each point, expanding the methods, qualifying the generalizability of the case study, and adding a dedicated limitations section. Below we respond to the major comments point by point.

read point-by-point responses

Referee: [Abstract] Abstract: The text states that the authors 'empirically investigate' and 'audit the performance' of three genAI applications for hallucinations and inconsistent moralization, yet supplies no description of the applications tested, the prompts used, the sample size or selection of queries, the coding scheme for misrepresentation, or any observed outputs/results. Without these elements it is impossible to evaluate whether the audit supports the central claim that the risk is 'substantive'.

Authors: We accept this criticism. The original submission did not include sufficient detail on the audit procedure. In the revised manuscript we have added a dedicated methods section that specifies the three applications, the complete prompt set, the sample size and selection criteria for queries (focused on documented WWII atrocities in Ukraine), the coding framework for detecting hallucinations, factual distortions, group misrepresentations, and selective moralization, and representative outputs with our annotations. These materials are also provided in full in a new appendix to enable independent evaluation. revision: yes
Referee: [Introduction / Discussion] The manuscript generalizes from a single national-historical case (Ukrainian WWII atrocities) and three unspecified applications to statements about 'high-risk memories' in general. No sampling rationale, comparison to other architectures/training regimes, or discussion of why these three tools suffice for representativeness is provided, so the observed patterns (whatever they may be) cannot be treated as typical of the model class.

Authors: We agree that the original text did not sufficiently delimit the scope of generalization. The revised introduction now explicitly frames the work as a focused case study of one high-risk memory domain, explains the rationale for selecting the Ukrainian WWII atrocities (scale of events and ongoing political contestation), and justifies the choice of the three consumer-facing applications as leading examples of current generative systems. The discussion has been updated to state that the observed patterns should not be assumed to hold for all architectures, training regimes, or other high-risk memories, and we call for comparative studies. revision: yes
Referee: [Discussion] No limitations section or explicit discussion of scope appears; the work therefore offers no basis for assessing external validity, prompt sensitivity, or temporal stability of any findings once models are updated.

Authors: We acknowledge the absence of an explicit limitations discussion. The revised manuscript now contains a new 'Limitations' section that addresses the study's scope, the sensitivity of outputs to prompt phrasing, the rapid evolution of models that may render findings time-bound, and the restricted external validity beyond the chosen case and tools. This section also notes that the audit provides a snapshot rather than a definitive characterization of all generative AI behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical audit with no derivations or self-referential reductions

full rationale

The paper conducts a direct empirical audit of three generative AI applications using prompts on Ukrainian WWII atrocities, evaluating outputs for hallucinations, factual distortions, and inconsistent moralization. No equations, parameters, or derivation chains exist that could reduce claims to inputs by construction. The methodology relies on explicit testing and qualitative analysis of AI responses rather than any fitted inputs renamed as predictions or self-citations that bear the load of the central findings. The assumption that the three applications provide insight into high-risk memory handling is a standard methodological choice for a case study, not a circular self-definition, and the results remain grounded in observable outputs without importing uniqueness theorems or ansatzes from prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the domain assumption that high-risk memories are especially vulnerable to AI distortion and that standard genAI behaviors like hallucinations apply here; no free parameters or new entities are introduced.

axioms (1)

domain assumption High-risk memories of past atrocities have strong emotional load and are often instrumentalised by political actors
Stated directly in the abstract as justification for the case choice.

pith-pipeline@v0.9.0 · 5483 in / 1048 out tokens · 49988 ms · 2026-05-10T12:31:37.296506+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages

[1]

no chance

to Kremlin disinformation reflect information gaps, not manipulation, in: Harvard Kennedy School Misinformation Review , 6 (5), 1–24. Ammari, Tawfiq/Chen, Meilun/Zaman, S. M./Garimella, Kiran. (2025): How students (really) use ChatGPT: Uncovering experiences among undergraduate students, https://arxiv.org/abs/2505.24126, last access: 15.2.2026. Amichay, R...

work page arXiv 2025
[2]

Makhortykh, Mykola/Sydorova, Maryna (2026): In Search of the North Star for AI and Holocaust Memory, in Digital Memory Dialogue s, 2, 1 –

2026
[3]

Bandera Debate

Menotti, Gabriel (2025): The model is the museum: Generative AI and the expropriation of cultural heritage, in: AI & SOCIETY , 1–5. Merrill, Samuel/Makhortykh, Mykola/Mandolessi, Silvana/Richardson-Walden, Victoria Grace/Smit, Rik/Wang, Qi (2025): Handling the hype: Demystifying artificial intelligence for memory studies, in: Memory, Mind & Media , 4, 1–1...

work page arXiv 2025