pith. sign in

arxiv: 2605.28481 · v1 · pith:SF5ORGEKnew · submitted 2026-05-27 · 💻 cs.DL

Co-creation of AI technology, empowering curators of cultural heritage information and guarding research commons

Pith reviewed 2026-06-29 09:27 UTC · model grok-4.3

classification 💻 cs.DL
keywords retrieval-augmented generationcultural heritage collectionslocal chatbotsco-creationresearch commonsdigital archivesinformation retrieval
0
0 comments X

The pith

Retrieval-augmented generation produces local chatbots for specific cultural heritage collections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces an engineering sequence that begins with open archives and ends with retrieval-augmented generation used to build local chatbots for individual digital collections of cultural assets. These collections come from institutions in the humanities and social sciences. The approach is presented as a way to let curators interact directly with their data while keeping control over the resources. A sympathetic reader would see the work as showing how AI tools can be adapted to the needs of cultural institutions rather than applied generically.

Core claim

The authors present a sequence of experimentations on a data-sharing and archiving platform that starts from archives for everyone and culminates in the implementation of a local chatbot for collections using retrieval-augmented generation. This method is described as the current endpoint of their work on digital collections of cultural assets in the humanities and social sciences.

What carries the argument

Retrieval-augmented generation (RAG) applied to specific digital collections, which first retrieves relevant documents from the collection and then generates responses to user queries.

If this is right

  • Curators gain direct ways to query and work with their own collections through an AI interface.
  • Research resources stay under local control because the chatbots are built for specific collections.
  • Co-creation between technology developers and cultural institutions produces tools matched to the domain.
  • The same engineering steps can be repeated for other collections in the humanities and social sciences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-augmented approach could be tested on other specialized but non-cultural archives.
  • Accuracy of the generated answers becomes a key practical limit that would need ongoing checks.
  • Wider use might shift how non-specialists first encounter primary cultural sources.
  • Questions of data ownership and update cycles for the underlying collections would need explicit rules.

Load-bearing premise

That retrieval-augmented generation can be effectively tailored to the unique characteristics of cultural heritage collections without introducing significant new problems for curators or research access.

What would settle it

A test in which the local chatbot produces inaccurate responses to queries about the collections or fails to give curators meaningful control over the data and outputs.

read the original abstract

The substance of this paper is the description of the use of Retrieval-Augmented Generation (RAG) for specific digital collections of cultural assets. The collections are provided by institutions operating in the cultural sector. The topical areas are the humanities and social sciences. More concretely, most of the work presented here was enabled by a European-funded research project MuseIT which is clearly situated in the realm of fostering new technologies for Cultural Heritage. We adhere to this interaction by presenting a sequence of our experimentations. This sequence is narrated as a specific journey of engineering all executed around a specific data-sharing and archiving platform Dataverse. Implementing a local chatbot for collections - a method also known as RAG in Information Retrieval - is the current culmination of this journey. The engineering journey we describe in the core of the paper starts from "archives for everyone" and ends with "local chatbots for specific collections".

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper describes a sequence of engineering experiments conducted on the Dataverse platform as part of the MuseIT project. It begins with efforts to create 'archives for everyone' and culminates in the implementation of a local Retrieval-Augmented Generation (RAG) chatbot for specific cultural heritage collections in the humanities and social sciences, framing this as co-creation of AI technology to empower curators and guard research commons.

Significance. The manuscript offers a descriptive case study of applying RAG to cultural heritage data on an established repository platform. If accompanied by evaluation data demonstrating curator empowerment and commons protection, it could serve as a practical reference for similar digital-library projects; in its current form, the lack of any performance metrics or user studies substantially reduces its contribution to the literature.

major comments (2)
  1. [Abstract] Abstract and core narrative: the title and framing assert that the RAG implementation empowers curators and guards research commons, yet the text contains no quantitative results, error analysis, user studies, or other evaluation data to support these outcomes.
  2. [Core narrative] Core of the paper (description of the engineering sequence): no details are provided on how the RAG pipeline was adapted to the distinctive characteristics of cultural-heritage collections (e.g., metadata heterogeneity, multilingual content, or access restrictions), leaving the central engineering claim unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review of our manuscript. We address each major comment below and indicate the revisions we will make to improve clarity and support for the described work.

read point-by-point responses
  1. Referee: [Abstract] Abstract and core narrative: the title and framing assert that the RAG implementation empowers curators and guards research commons, yet the text contains no quantitative results, error analysis, user studies, or other evaluation data to support these outcomes.

    Authors: The manuscript is structured as a descriptive case study of an engineering sequence within the MuseIT project, documenting the progression from general archiving efforts to local RAG implementations on Dataverse. The title and framing articulate the project's co-creation goals rather than reporting measured outcomes. We acknowledge that this distinction could be made more explicit to prevent any implication of empirical validation. We will revise the abstract and core narrative to emphasize the descriptive scope and add a dedicated section outlining plans for future evaluations, including potential user studies with curators. revision: yes

  2. Referee: [Core narrative] Core of the paper (description of the engineering sequence): no details are provided on how the RAG pipeline was adapted to the distinctive characteristics of cultural-heritage collections (e.g., metadata heterogeneity, multilingual content, or access restrictions), leaving the central engineering claim unsupported.

    Authors: The core narrative focuses on the high-level engineering journey and Dataverse integration for the specified humanities and social sciences collections. While standard RAG components were applied, explicit discussion of adaptations for metadata heterogeneity, multilingual content, or access restrictions is indeed limited in the current text. We will expand the core sections to describe any collection-specific preprocessing, metadata handling, or language accommodations that were implemented during the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a narrative description of an engineering sequence on the Dataverse platform, progressing from 'archives for everyone' to a RAG-based local chatbot for cultural-heritage collections within the MuseIT project. No equations, derivations, predictions, fitted parameters, or formal claims are advanced. There are no load-bearing steps that reduce by construction to inputs, self-citations, or ansatzes. The content is purely descriptive with no mathematical or predictive chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper is a descriptive engineering narrative with no mathematical content, fitted parameters, formal axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5701 in / 905 out tokens · 28383 ms · 2026-06-29T09:27:51.498867+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    (2024) Business insights using RAG–LLMs: a review and case study

    Arslan M, Munawar S and Cruz C. (2024) Business insights using RAG–LLMs: a review and case study. Journal of Decision Systems, pp. 1–30, doi: 10.1080/12460125.2024.2410040 Akthar M, Benjelloun O, Conforti C, Foschini L, Gijsbers P , Giner Miguelez J, Goswami S, Jain N, Karamousadakis M, Krishna S, Kuchnik M, Lesage S, Lhoest Q, Marcenac P , Maskey M, Matt...

  2. [2]

    pp. 4–20. (CEUR Workshop Proceedings; Vol. 3617). CEUR- WS.org. https://doi.org/10.5281/zenodo.10529113 De Vries J, Tykhonov V , Scharnhorst A, Indarto E, Priddy M, and Admiraal F (2022) Flex- ible Metadata Schemes for Research Data Repositories. The Common Framework in Dataverse and the CMDI Use Case. In: Monachini M and Eskevich M (Eds.): Selected Paper...

  3. [3]

    Springer, Cham

    Lecture Notes in Computer Science, vol 15824. Springer, Cham. https://doi.org/10.1007/978-3-031-93064-5_19 Preprint available: arXiv:2504.05976 Kontogiannis S, Christodoulou G, Papadopoulos V , Iosif M, Kosmides P , Johansson, M, Darányi S, Van Erven T, Tykhonov V , Ferguson KB, Scharnhorst A, Meroño- Peñuela A, Farina A and McGillivray B (2025) MuseIT re...

  4. [4]

    402–408)

    (pp. 402–408). Atlantis Press. DOI: 10.2991/978-94-6463-512-6_43 Mayr P , Tykhonov S, Touber J, and Scharnhorst, A (2025) Chatting with Papers – the mixed use of LLMs and semantic artifacts to support the understanding of science dynamics. Presentation given at the workshop Large Language Models for the History, Philosophy, and Sociology of Science, April...

  5. [5]

    Making the Global Open Research Commons Truly Global: A report from the Lorentz Workshop, July 21–25

    Treloar A, Woodford CJ, Genova F, Harrower N, Scharnhorst A, Teperek M, Tsang E, Do- ran M, Ferrari T, Gregory K, Grossman R, Hoogerwerf M, Hugo W , Jetten M, Matas LJ, Miedema M, Macneil R, Newbold E, Parland-von Essen J, Sesink L, Nyberg Åker- ström W (2025). Making the Global Open Research Commons Truly Global: A report from the Lorentz Workshop, July 21–25

  6. [6]

    Treloar A and Woodford CJ (2024) Global Open Research Commons: Creating an Inter- national Model for Improved Interoperability and Collaboration

    DOI: https://doi.org/10.5281/zenodo.17230153. Treloar A and Woodford CJ (2024) Global Open Research Commons: Creating an Inter- national Model for Improved Interoperability and Collaboration. Data Science Journal 23(56) pp. 1–9. DOI: https://doi.org/10.5334/dsj-2024-056 Tykhonov V (2020) CoronaWhy: Fight against COVID-19. Video of a presentation. Avail- a...

  7. [7]

    Finetuned Language Models Are Zero-Shot Learners

    Preprint 2021: https://arxiv.org/abs/2109.01652 Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi EH, Le QV , Zhou D (2022b) Chain-of-thought prompting elicits reasoning in large language models. NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 1800, Pages 24824 – 24837 Wilkinson MD, ...

  8. [8]

    Chicago: University of Chicago Press 412 Part 5: Retrieval-augmented generation (RAG) Acknowledgement This paper has been made possible by various research projects

    Aachen: CEUR Workshop Proceedings, 1613–0073, 3882 Zilsel E (1942) The Sociological Roots of Science. Chicago: University of Chicago Press 412 Part 5: Retrieval-augmented generation (RAG) Acknowledgement This paper has been made possible by various research projects. First, the MuseIT project, coordinated by Nasrine Olson at Högskolan i Borås. MuseIT is c...