Co-creation of AI technology, empowering curators of cultural heritage information and guarding research commons

Andrea Scharnhorst; Han Yang; Jetze Touber; Kim Ferguson; Philipp Mayr; Vyacheslav Tykhonov

arxiv: 2605.28481 · v1 · pith:SF5ORGEKnew · submitted 2026-05-27 · 💻 cs.DL

Co-creation of AI technology, empowering curators of cultural heritage information and guarding research commons

Andrea Scharnhorst , Han Yang , Jetze Touber , Kim Ferguson , Philipp Mayr , Vyacheslav Tykhonov This is my paper

Pith reviewed 2026-06-29 09:27 UTC · model grok-4.3

classification 💻 cs.DL

keywords retrieval-augmented generationcultural heritage collectionslocal chatbotsco-creationresearch commonsdigital archivesinformation retrieval

0 comments

The pith

Retrieval-augmented generation produces local chatbots for specific cultural heritage collections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces an engineering sequence that begins with open archives and ends with retrieval-augmented generation used to build local chatbots for individual digital collections of cultural assets. These collections come from institutions in the humanities and social sciences. The approach is presented as a way to let curators interact directly with their data while keeping control over the resources. A sympathetic reader would see the work as showing how AI tools can be adapted to the needs of cultural institutions rather than applied generically.

Core claim

The authors present a sequence of experimentations on a data-sharing and archiving platform that starts from archives for everyone and culminates in the implementation of a local chatbot for collections using retrieval-augmented generation. This method is described as the current endpoint of their work on digital collections of cultural assets in the humanities and social sciences.

What carries the argument

Retrieval-augmented generation (RAG) applied to specific digital collections, which first retrieves relevant documents from the collection and then generates responses to user queries.

If this is right

Curators gain direct ways to query and work with their own collections through an AI interface.
Research resources stay under local control because the chatbots are built for specific collections.
Co-creation between technology developers and cultural institutions produces tools matched to the domain.
The same engineering steps can be repeated for other collections in the humanities and social sciences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-augmented approach could be tested on other specialized but non-cultural archives.
Accuracy of the generated answers becomes a key practical limit that would need ongoing checks.
Wider use might shift how non-specialists first encounter primary cultural sources.
Questions of data ownership and update cycles for the underlying collections would need explicit rules.

Load-bearing premise

That retrieval-augmented generation can be effectively tailored to the unique characteristics of cultural heritage collections without introducing significant new problems for curators or research access.

What would settle it

A test in which the local chatbot produces inaccurate responses to queries about the collections or fails to give curators meaningful control over the data and outputs.

read the original abstract

The substance of this paper is the description of the use of Retrieval-Augmented Generation (RAG) for specific digital collections of cultural assets. The collections are provided by institutions operating in the cultural sector. The topical areas are the humanities and social sciences. More concretely, most of the work presented here was enabled by a European-funded research project MuseIT which is clearly situated in the realm of fostering new technologies for Cultural Heritage. We adhere to this interaction by presenting a sequence of our experimentations. This sequence is narrated as a specific journey of engineering all executed around a specific data-sharing and archiving platform Dataverse. Implementing a local chatbot for collections - a method also known as RAG in Information Retrieval - is the current culmination of this journey. The engineering journey we describe in the core of the paper starts from "archives for everyone" and ends with "local chatbots for specific collections".

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a project report narrating the steps to a RAG chatbot on Dataverse for cultural heritage collections, with no evaluation or new technical claims.

read the letter

The paper walks through an engineering sequence in the MuseIT project that starts with open archives and ends with a local RAG-based chatbot for specific cultural heritage collections on the Dataverse platform. That is the core content.

They do a decent job laying out the practical steps taken around data sharing and archiving tools. The description stays grounded in the actual collections from humanities and social sciences institutions and shows how standard RAG was fitted to that setting. Readers who have worked with similar platforms may recognize the progression and pick up a few implementation details.

The main limitation is the complete absence of any results. There are no performance numbers, no error analysis, no curator feedback, and no comparison to other approaches. The statements about empowering curators and protecting research commons remain assertions without supporting evidence. Because the paper offers only the narrative of the journey, it is hard to judge whether the final chatbot actually works as intended or introduces new problems.

This kind of write-up is mainly useful to practitioners in digital heritage or library technology who want to see how one group moved from general archives to a tailored chatbot. It is not aimed at researchers looking for new methods or validated findings.

I would not send it for peer review. It reads as an honest project description rather than a contribution that needs referee scrutiny.

Referee Report

2 major / 0 minor

Summary. The paper describes a sequence of engineering experiments conducted on the Dataverse platform as part of the MuseIT project. It begins with efforts to create 'archives for everyone' and culminates in the implementation of a local Retrieval-Augmented Generation (RAG) chatbot for specific cultural heritage collections in the humanities and social sciences, framing this as co-creation of AI technology to empower curators and guard research commons.

Significance. The manuscript offers a descriptive case study of applying RAG to cultural heritage data on an established repository platform. If accompanied by evaluation data demonstrating curator empowerment and commons protection, it could serve as a practical reference for similar digital-library projects; in its current form, the lack of any performance metrics or user studies substantially reduces its contribution to the literature.

major comments (2)

[Abstract] Abstract and core narrative: the title and framing assert that the RAG implementation empowers curators and guards research commons, yet the text contains no quantitative results, error analysis, user studies, or other evaluation data to support these outcomes.
[Core narrative] Core of the paper (description of the engineering sequence): no details are provided on how the RAG pipeline was adapted to the distinctive characteristics of cultural-heritage collections (e.g., metadata heterogeneity, multilingual content, or access restrictions), leaving the central engineering claim unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review of our manuscript. We address each major comment below and indicate the revisions we will make to improve clarity and support for the described work.

read point-by-point responses

Referee: [Abstract] Abstract and core narrative: the title and framing assert that the RAG implementation empowers curators and guards research commons, yet the text contains no quantitative results, error analysis, user studies, or other evaluation data to support these outcomes.

Authors: The manuscript is structured as a descriptive case study of an engineering sequence within the MuseIT project, documenting the progression from general archiving efforts to local RAG implementations on Dataverse. The title and framing articulate the project's co-creation goals rather than reporting measured outcomes. We acknowledge that this distinction could be made more explicit to prevent any implication of empirical validation. We will revise the abstract and core narrative to emphasize the descriptive scope and add a dedicated section outlining plans for future evaluations, including potential user studies with curators. revision: yes
Referee: [Core narrative] Core of the paper (description of the engineering sequence): no details are provided on how the RAG pipeline was adapted to the distinctive characteristics of cultural-heritage collections (e.g., metadata heterogeneity, multilingual content, or access restrictions), leaving the central engineering claim unsupported.

Authors: The core narrative focuses on the high-level engineering journey and Dataverse integration for the specified humanities and social sciences collections. While standard RAG components were applied, explicit discussion of adaptations for metadata heterogeneity, multilingual content, or access restrictions is indeed limited in the current text. We will expand the core sections to describe any collection-specific preprocessing, metadata handling, or language accommodations that were implemented during the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a narrative description of an engineering sequence on the Dataverse platform, progressing from 'archives for everyone' to a RAG-based local chatbot for cultural-heritage collections within the MuseIT project. No equations, derivations, predictions, fitted parameters, or formal claims are advanced. There are no load-bearing steps that reduce by construction to inputs, self-citations, or ansatzes. The content is purely descriptive with no mathematical or predictive chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper is a descriptive engineering narrative with no mathematical content, fitted parameters, formal axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5701 in / 905 out tokens · 28383 ms · 2026-06-29T09:27:51.498867+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 6 canonical work pages · 2 internal anchors

[1]

(2024) Business insights using RAG–LLMs: a review and case study

Arslan M, Munawar S and Cruz C. (2024) Business insights using RAG–LLMs: a review and case study. Journal of Decision Systems, pp. 1–30, doi: 10.1080/12460125.2024.2410040 Akthar M, Benjelloun O, Conforti C, Foschini L, Gijsbers P , Giner Miguelez J, Goswami S, Jain N, Karamousadakis M, Krishna S, Kuchnik M, Lesage S, Lhoest Q, Marcenac P , Maskey M, Matt...

work page doi:10.1080/12460125.2024.2410040 2024
[2]

pp. 4–20. (CEUR Workshop Proceedings; Vol. 3617). CEUR- WS.org. https://doi.org/10.5281/zenodo.10529113 De Vries J, Tykhonov V , Scharnhorst A, Indarto E, Priddy M, and Admiraal F (2022) Flex- ible Metadata Schemes for Research Data Repositories. The Common Framework in Dataverse and the CMDI Use Case. In: Monachini M and Eskevich M (Eds.): Selected Paper...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.10529113 2022
[3]

Springer, Cham

Lecture Notes in Computer Science, vol 15824. Springer, Cham. https://doi.org/10.1007/978-3-031-93064-5_19 Preprint available: arXiv:2504.05976 Kontogiannis S, Christodoulou G, Papadopoulos V , Iosif M, Kosmides P , Johansson, M, Darányi S, Van Erven T, Tykhonov V , Ferguson KB, Scharnhorst A, Meroño- Peñuela A, Farina A and McGillivray B (2025) MuseIT re...

work page doi:10.1007/978-3-031-93064-5_19 2025
[4]

402–408)

(pp. 402–408). Atlantis Press. DOI: 10.2991/978-94-6463-512-6_43 Mayr P , Tykhonov S, Touber J, and Scharnhorst, A (2025) Chatting with Papers – the mixed use of LLMs and semantic artifacts to support the understanding of science dynamics. Presentation given at the workshop Large Language Models for the History, Philosophy, and Sociology of Science, April...

work page doi:10.2991/978-94-6463-512-6_43 2025
[5]

Making the Global Open Research Commons Truly Global: A report from the Lorentz Workshop, July 21–25

Treloar A, Woodford CJ, Genova F, Harrower N, Scharnhorst A, Teperek M, Tsang E, Do- ran M, Ferrari T, Gregory K, Grossman R, Hoogerwerf M, Hugo W , Jetten M, Matas LJ, Miedema M, Macneil R, Newbold E, Parland-von Essen J, Sesink L, Nyberg Åker- ström W (2025). Making the Global Open Research Commons Truly Global: A report from the Lorentz Workshop, July 21–25

2025
[6]

Treloar A and Woodford CJ (2024) Global Open Research Commons: Creating an Inter- national Model for Improved Interoperability and Collaboration

DOI: https://doi.org/10.5281/zenodo.17230153. Treloar A and Woodford CJ (2024) Global Open Research Commons: Creating an Inter- national Model for Improved Interoperability and Collaboration. Data Science Journal 23(56) pp. 1–9. DOI: https://doi.org/10.5334/dsj-2024-056 Tykhonov V (2020) CoronaWhy: Fight against COVID-19. Video of a presentation. Avail- a...

work page doi:10.5281/zenodo.17230153 2024
[7]

Finetuned Language Models Are Zero-Shot Learners

Preprint 2021: https://arxiv.org/abs/2109.01652 Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi EH, Le QV , Zhou D (2022b) Chain-of-thought prompting elicits reasoning in large language models. NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 1800, Pages 24824 – 24837 Wilkinson MD, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3030/823782 2021
[8]

Chicago: University of Chicago Press 412 Part 5: Retrieval-augmented generation (RAG) Acknowledgement This paper has been made possible by various research projects

Aachen: CEUR Workshop Proceedings, 1613–0073, 3882 Zilsel E (1942) The Sociological Roots of Science. Chicago: University of Chicago Press 412 Part 5: Retrieval-augmented generation (RAG) Acknowledgement This paper has been made possible by various research projects. First, the MuseIT project, coordinated by Nasrine Olson at Högskolan i Borås. MuseIT is c...

1942

[1] [1]

(2024) Business insights using RAG–LLMs: a review and case study

Arslan M, Munawar S and Cruz C. (2024) Business insights using RAG–LLMs: a review and case study. Journal of Decision Systems, pp. 1–30, doi: 10.1080/12460125.2024.2410040 Akthar M, Benjelloun O, Conforti C, Foschini L, Gijsbers P , Giner Miguelez J, Goswami S, Jain N, Karamousadakis M, Krishna S, Kuchnik M, Lesage S, Lhoest Q, Marcenac P , Maskey M, Matt...

work page doi:10.1080/12460125.2024.2410040 2024

[2] [2]

pp. 4–20. (CEUR Workshop Proceedings; Vol. 3617). CEUR- WS.org. https://doi.org/10.5281/zenodo.10529113 De Vries J, Tykhonov V , Scharnhorst A, Indarto E, Priddy M, and Admiraal F (2022) Flex- ible Metadata Schemes for Research Data Repositories. The Common Framework in Dataverse and the CMDI Use Case. In: Monachini M and Eskevich M (Eds.): Selected Paper...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.10529113 2022

[3] [3]

Springer, Cham

Lecture Notes in Computer Science, vol 15824. Springer, Cham. https://doi.org/10.1007/978-3-031-93064-5_19 Preprint available: arXiv:2504.05976 Kontogiannis S, Christodoulou G, Papadopoulos V , Iosif M, Kosmides P , Johansson, M, Darányi S, Van Erven T, Tykhonov V , Ferguson KB, Scharnhorst A, Meroño- Peñuela A, Farina A and McGillivray B (2025) MuseIT re...

work page doi:10.1007/978-3-031-93064-5_19 2025

[4] [4]

402–408)

(pp. 402–408). Atlantis Press. DOI: 10.2991/978-94-6463-512-6_43 Mayr P , Tykhonov S, Touber J, and Scharnhorst, A (2025) Chatting with Papers – the mixed use of LLMs and semantic artifacts to support the understanding of science dynamics. Presentation given at the workshop Large Language Models for the History, Philosophy, and Sociology of Science, April...

work page doi:10.2991/978-94-6463-512-6_43 2025

[5] [5]

Making the Global Open Research Commons Truly Global: A report from the Lorentz Workshop, July 21–25

Treloar A, Woodford CJ, Genova F, Harrower N, Scharnhorst A, Teperek M, Tsang E, Do- ran M, Ferrari T, Gregory K, Grossman R, Hoogerwerf M, Hugo W , Jetten M, Matas LJ, Miedema M, Macneil R, Newbold E, Parland-von Essen J, Sesink L, Nyberg Åker- ström W (2025). Making the Global Open Research Commons Truly Global: A report from the Lorentz Workshop, July 21–25

2025

[6] [6]

Treloar A and Woodford CJ (2024) Global Open Research Commons: Creating an Inter- national Model for Improved Interoperability and Collaboration

DOI: https://doi.org/10.5281/zenodo.17230153. Treloar A and Woodford CJ (2024) Global Open Research Commons: Creating an Inter- national Model for Improved Interoperability and Collaboration. Data Science Journal 23(56) pp. 1–9. DOI: https://doi.org/10.5334/dsj-2024-056 Tykhonov V (2020) CoronaWhy: Fight against COVID-19. Video of a presentation. Avail- a...

work page doi:10.5281/zenodo.17230153 2024

[7] [7]

Finetuned Language Models Are Zero-Shot Learners

Preprint 2021: https://arxiv.org/abs/2109.01652 Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi EH, Le QV , Zhou D (2022b) Chain-of-thought prompting elicits reasoning in large language models. NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 1800, Pages 24824 – 24837 Wilkinson MD, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3030/823782 2021

[8] [8]

Chicago: University of Chicago Press 412 Part 5: Retrieval-augmented generation (RAG) Acknowledgement This paper has been made possible by various research projects

Aachen: CEUR Workshop Proceedings, 1613–0073, 3882 Zilsel E (1942) The Sociological Roots of Science. Chicago: University of Chicago Press 412 Part 5: Retrieval-augmented generation (RAG) Acknowledgement This paper has been made possible by various research projects. First, the MuseIT project, coordinated by Nasrine Olson at Högskolan i Borås. MuseIT is c...

1942