arxiv: 2605.10593 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL· cs.HC· cs.SE

Recognition: no theorem link

LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation

Eric Rudolph, Jennifer Burghardt, Jens Albrecht, Mara Stieler, Philipp Steigerwald

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:49 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HCcs.SE

keywords LLARSLLMprompt engineeringhybrid evaluationcollaborative AI toolsbatch generationdomain expert collaboration

0 comments

The pith

LLARS integrates collaborative prompt engineering, batch generation, and hybrid evaluation into a single platform for domain experts and developers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes LLARS, an open-source platform that enables domain experts and developers to collaborate on LLM-based systems through three connected modules. Collaborative Prompt Engineering supports real-time co-authoring of prompts with version control and instant testing. Batch Generation allows producing outputs from selected prompts, models, and data while controlling costs. Hybrid Evaluation combines human and LLM assessors with agreement metrics to find the best combinations. Interviews in the online counselling field showed users find it intuitive and time-saving by keeping all work in one place with seamless handoffs.

Core claim

LLARS is an open-source platform that bridges domain experts and developers for LLM systems by integrating Collaborative Prompt Engineering for real-time co-authoring with version control and instant LLM testing, Batch Generation for configurable output production across user-selected prompts times models times data with cost control, and Hybrid Evaluation where human and LLM evaluators jointly assess outputs through diverse methods with live agreement metrics and provenance analysis. New prompts and models are automatically available for batch generation, and completed batches can be turned into evaluation scenarios with a single click. User interviews confirmed the system feels intuitive,

What carries the argument

The three tightly connected modules of Collaborative Prompt Engineering, Batch Generation, and Hybrid Evaluation that form an automatic end-to-end pipeline.

If this is right

Real-time co-authoring of prompts with immediate LLM testing speeds up the engineering phase.
Configurable batch runs across multiple prompts and models enable systematic comparisons with cost oversight.
Hybrid evaluation with agreement metrics and provenance tracking helps pinpoint effective model-prompt pairs.
Automatic flow from new prompts to generation and from batches to evaluations minimizes manual steps.
Open-source release supports wider testing and adaptation in other domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The platform's design may encourage more structured documentation of prompt development processes.
It could be adapted for domains requiring high-stakes decisions by adding specialized evaluation criteria.
Wider use might reveal needs for enhanced security or data privacy features in collaborative settings.
Integration with existing version control systems beyond the built-in one could further improve team workflows.

Load-bearing premise

The platform's described features are fully implemented and operational in the open-source release, and the positive experiences of nine participants in one specialized domain indicate general applicability and time savings for other users.

What would settle it

Independent groups installing the open-source LLARS and applying it in different fields, then measuring actual time spent and collaboration ease compared to previous tool combinations.

Figures

Figures reproduced from arXiv: 2605.10593 by Eric Rudolph, Jennifer Burghardt, Jens Albrecht, Mara Stieler, Philipp Steigerwald.

**Figure 1.** Figure 1: LLARS pipeline: domain experts and developers collaboratively develop prompts, generate outputs across LLMs and run hybrid evaluation with human and LLM evaluators. Each stage supports export and the pipeline yields a validated model–prompt combination. with version control and instant LLM testing. Batch Generation produces the user-configured Cartesian product of prompts × models × data items with cos… view at source ↗

**Figure 2.** Figure 2: Collaborative prompt editor with ordered blocks and tem [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Provenance analysis ranking model–prompt combinations [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for real-time co-authoring with version control and instant LLM testing, Batch Generation for configurable output production across user-selected prompts $\times$ models $\times$ data with cost control, and Hybrid Evaluation where human and LLM evaluators jointly assess outputs through diverse assessment methods, with live agreement metrics and provenance analysis to identify the best model-prompt combination for a given use case. New prompts and models are automatically available for batch generation and completed batches can be turned into evaluation scenarios with a single click. Interviews with six domain experts and three developers in online counselling confirmed that LLARS feels intuitive, saves considerable time by keeping everything in one place and makes interdisciplinary collaboration seamless.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLARS links prompt collab, batch runs, and hybrid eval in one workflow with some useful glue, but nine interviews in one domain give weak support for the usability claims.

read the letter

LLARS is an open-source platform that wires real-time collaborative prompting with version control, configurable batch generation across prompts/models/data, and hybrid human-LLM evaluation that includes live agreement metrics and provenance tracking. Completed batches convert to evaluation scenarios with a single click, and new prompts or models feed forward automatically. That end-to-end linkage is the concrete new piece; most prior tools handle one or two of these steps but not the full handoff with cost controls and provenance in the same place.

Referee Report

2 major / 2 minor

Summary. The paper presents LLARS, an open-source platform integrating three modules—Collaborative Prompt Engineering (real-time co-authoring with version control and LLM testing), Batch Generation (configurable output across prompts, models, and data with cost control), and Hybrid Evaluation (joint human-LLM assessment with agreement metrics and provenance)—into an end-to-end pipeline for domain expert and developer collaboration on LLM systems. It reports that interviews with six domain experts and three developers in online counselling confirmed the system feels intuitive, saves time by centralizing workflows, and enables seamless interdisciplinary collaboration.

Significance. If the modules are fully implemented and interoperable, and if the usability claims can be substantiated beyond the current qualitative sample, LLARS could offer a practical contribution to tools supporting LLM application development. The tight integration of prompting, generation, and evaluation with automatic handoff between modules is a potential strength for reducing context-switching in interdisciplinary teams. However, the absence of quantitative metrics, code artifacts, or broader validation limits the assessed impact.

major comments (2)

[Abstract] Abstract and reported user study: The central claim that LLARS 'saves considerable time' and 'makes interdisciplinary collaboration seamless' rests on qualitative impressions from nine participants in a single narrow domain (online counselling). No quantitative usage logs, time measurements, comparison baselines, or detailed methodology (e.g., interview protocol, coding scheme, or inter-rater reliability) are provided, making it impossible to evaluate the strength or generalizability of the usability conclusions.
[System Overview] System description: The paper describes LLARS as a fully functional open-source platform with three interoperable modules and one-click transitions (e.g., completed batches turned into evaluation scenarios), yet provides no repository link, deployment artifacts, usage logs, or verification that the described features (real-time collaboration, cost control, live agreement metrics) are actually implemented and operational.

minor comments (2)

[Introduction] The abstract and introduction would benefit from explicit references to related tools (e.g., existing prompt engineering platforms or evaluation frameworks) to clarify the novelty of the integration.
[Figures] Figure captions and module diagrams should include concrete examples of the data flow between Collaborative Prompt Engineering, Batch Generation, and Hybrid Evaluation to improve clarity for readers unfamiliar with the workflow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the clarity of our claims and the verifiability of the system. We respond to each major comment below and note the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract and reported user study: The central claim that LLARS 'saves considerable time' and 'makes interdisciplinary collaboration seamless' rests on qualitative impressions from nine participants in a single narrow domain (online counselling). No quantitative usage logs, time measurements, comparison baselines, or detailed methodology (e.g., interview protocol, coding scheme, or inter-rater reliability) are provided, making it impossible to evaluate the strength or generalizability of the usability conclusions.

Authors: We agree that the reported user study is qualitative, based on impressions from a small targeted sample in one domain, and does not include quantitative metrics, time logs, or baselines. This was designed as an initial exploratory validation of usability and collaboration benefits rather than a controlled experiment. We will revise the abstract to use more measured language reflecting user-reported impressions. We will also expand the methods description to include the interview protocol, participant recruitment, and analysis approach. We cannot add quantitative data without conducting a new study, but the qualitative results still provide relevant evidence for a systems paper focused on interdisciplinary workflows. revision: partial
Referee: [System Overview] System description: The paper describes LLARS as a fully functional open-source platform with three interoperable modules and one-click transitions (e.g., completed batches turned into evaluation scenarios), yet provides no repository link, deployment artifacts, usage logs, or verification that the described features (real-time collaboration, cost control, live agreement metrics) are actually implemented and operational.

Authors: We will include the GitHub repository link and basic deployment instructions in the revised version to allow verification of the open-source implementation. The three modules are fully interoperable as described, with features such as real-time co-authoring, cost controls, and live agreement metrics having been implemented and tested during development. We can add supplementary artifacts like example screenshots or configuration details if helpful. Usage logs were not collected, as the evaluation focused on qualitative feedback from the interviews. revision: yes

Circularity Check

0 steps flagged

No circularity; descriptive system paper with no derivations or predictions

full rationale

The paper presents LLARS as an integrated platform with three modules and reports qualitative feedback from nine interviews in one domain. No equations, fitted parameters, predictions, uniqueness theorems, or derivation chains exist that could reduce to self-definitions, fitted inputs, or self-citations. Claims rest on system description and user impressions without any load-bearing step that equates output to input by construction. This matches the reader's 0.0 assessment; the work is self-contained as a tool-building and evaluation report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a system description and user study with no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5466 in / 1169 out tokens · 62889 ms · 2026-05-12T03:49:34.898504+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

ACM Transactions on Intelligent Systems and Technology , volume=

A Survey on Evaluation of Large Language Models , author=. ACM Transactions on Intelligent Systems and Technology , volume=. 2024 , publisher=

work page 2024
[2]

Proceedings of the 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health (ICT4AWE) , pages=

Comparing Large Language Models for Automated Subject Line Generation in e-Mental Health: A Performance Study , author=. Proceedings of the 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health (ICT4AWE) , pages=. 2025 , publisher=

work page 2025
[3]

2025 , pages=

Steigerwald, Philipp and Bienlein, Nico and Burghardt, Jennifer and Stieler, Mara and Lehmann, Robert and Albrecht, Jens , booktitle=. 2025 , pages=

work page 2025
[4]

Enhancing Psychosocial Counselling with

Steigerwald, Philipp and Albrecht, Jens , booktitle=. Enhancing Psychosocial Counselling with

work page
[5]

Rudolph, Eric and Engert, Natalie and Albrecht, Jens , booktitle=. An. 2024 , doi=

work page 2024
[6]

The Virtual Client: Leveraging Generative

Albrecht, Jens and Rudolph, Eric and Poltermann, Aleksandra and Lehmann, Robert , booktitle=. The Virtual Client: Leveraging Generative

work page
[7]

2024 , howpublished=

Regulation (. 2024 , howpublished=

work page 2024
[8]

2024 , publisher=

Arawjo, Ian and Swoopes, Chelse and Vaithilingam, Priyan and Wattenberg, Martin and Glassman, Elena , booktitle=. 2024 , publisher=

work page 2024
[9]

2024 , howpublished=

Agenta: Open-Source. 2024 , howpublished=

work page 2024
[10]

2024 , howpublished=

Phoenix: Open-Source. 2024 , howpublished=

work page 2024
[11]

2024 , howpublished=

Langfuse: Open-Source. 2024 , howpublished=

work page 2024
[12]

2024 , howpublished=

work page 2024
[13]

2024 , howpublished=

Weave: Toolkit for Developing. 2024 , howpublished=

work page 2024
[14]

2024 , howpublished=

Braintrust:. 2024 , howpublished=

work page 2024
[15]

2024 , howpublished=

Maxim:. 2024 , howpublished=

work page 2024
[16]

2024 , howpublished=

Vellum:. 2024 , howpublished=

work page 2024
[17]

2024 , howpublished=

Label Studio: Open Source Data Labeling Platform , author=. 2024 , howpublished=

work page 2024
[18]

2024 , howpublished=

Argilla: Collaboration Tool for. 2024 , howpublished=

work page 2024
[19]

2024 , howpublished=

Prodigy: An Annotation Tool for. 2024 , howpublished=

work page 2024
[20]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others , booktitle=. Judging

work page
[21]

Transactions on Machine Learning Research , year=

A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law , author=. Transactions on Machine Learning Research , year=

work page
[22]

ACM Computing Surveys , volume=

Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey , author=. ACM Computing Surveys , volume=. 2025 , publisher=

work page 2025
[23]

npj Mental Health Research , volume=

Large Language Models Could Change the Future of Behavioral Healthcare: A Proposal for Responsible Development and Evaluation , author=. npj Mental Health Research , volume=. 2024 , publisher=

work page 2024
[24]

2024 , doi=

Katz, Daniel Martin and Bommarito, Michael James and Gao, Shang and Arredondo, Pablo , journal=. 2024 , doi=

work page 2024
[25]

Nature Medicine , volume=

Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization , author=. Nature Medicine , volume=. 2024 , publisher=

work page 2024
[26]

and Wong, Richmond Y

Zamfirescu-Pereira, J.D. and Wong, Richmond Y. and Hartmann, Bjoern and Yang, Qian , booktitle=. Why Johnny Can't Prompt: How Non-. 2023 , publisher=

work page 2023
[27]

The prompt report: A systematic survey of prompting techniques.arXiv preprint arXiv:2406.06608, 2024

The Prompt Report: A Systematic Survey of Prompting Techniques , author=. arXiv preprint arXiv:2406.06608 , year=

work page arXiv
[28]

Psychological Review , volume=

The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information , author=. Psychological Review , volume=. 1956 , publisher=

work page 1956
[29]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=. 2017 , publisher=

work page 2017
[30]

Computing

Krippendorff, Klaus , journal=. Computing. 2011 , publisher=

work page 2011
[31]

Proceedings of the 13th International Conference on Stabilization, Safety, and Security of Distributed Systems (SSS) , pages=

Conflict-Free Replicated Data Types , author=. Proceedings of the 13th International Conference on Stabilization, Safety, and Security of Distributed Systems (SSS) , pages=. 2011 , publisher=

work page 2011
[32]

Jahns, Kevin , year=. Yjs: A

work page
[33]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle=

work page
[34]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume=

work page