LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

Alessio Buscemi; Alfredo Capozucca; Barbara Delacroix; German Castignani; Tom Lucas

arxiv: 2605.31167 · v1 · pith:FE4DBYVUnew · submitted 2026-05-29 · 💻 cs.AI

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

Tom Lucas , Alessio Buscemi , Alfredo Capozucca , German Castignani , Barbara Delacroix This is my paper

Pith reviewed 2026-06-28 22:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM auditingtransparency mechanismsprivacy-preserving evaluationhallucination detectionmulti-judge consensusRAG triad metricsepistemic uncertaintyself-hosted framework

0 comments

The pith

LLM-FACETS lets non-technical users audit LLM outputs for factuality and uncertainty with all deterministic metrics kept inside a self-hosted server.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LLM-FACETS as an open-source framework that removes programming barriers and external data transmission from LLM auditing. It organizes the interface around three practitioner profiles that match categories in the EU AI Act and NIST frameworks. Deterministic metrics such as BLEU, ROUGE, and BERTScore run entirely locally with no outbound transmission, while LLM-judge metrics require explicit user credential control for external APIs. Three specific mechanisms operationalize transparency: token-level log-probability views for uncertainty, multi-judge consensus to reduce bias, and RAG Triad scores to locate hallucinations. The plugin design allows new metrics or datasets to be added without altering the core pipeline, and the implementation is checked through cross-validation of 18 metrics against reference libraries.

Core claim

LLM-FACETS operationalizes transparency through three mechanisms: token-level log-probability visualization for epistemic uncertainty, multi-judge consensus to mitigate judge bias, and RAG Triad metrics (Faithfulness, Answer Relevance, Context Relevance) to detect and localize hallucinations, all inside an architecture where deterministic metrics execute entirely within the self-hosted server with no outbound transmission and LLM-judge metrics contact external APIs only under explicit user credential control.

What carries the argument

The plugin architecture combined with explicit separation of data flows between fully local deterministic metrics and user-controlled external LLM-judge metrics.

If this is right

Cross-checking multiple metrics that target the same property becomes possible without changing the evaluation pipeline.
Accountability for model outputs is separated from the teams that build the models.
New metrics or datasets can be added through plugins without modifying the core system.
Compliance officers and domain experts gain direct access to evaluation results that match regulatory stakeholder categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar local-first evaluation designs could be applied to other generative systems once equivalent local metrics exist for those systems.
Organizations could reduce compliance risks by adopting the explicit data-flow controls when auditing models on regulated data.
Contributors could build domain-specific plugins to extend hallucination localization to specialized fields such as medical or legal text.

Load-bearing premise

The self-hosted server and browser interface can be implemented and used by non-technical practitioners without creating new data leakage paths or demanding complex configuration.

What would settle it

A test showing that running the deterministic metrics requires sending data to external services or that non-technical users need programming skills to operate the interface would falsify the privacy and accessibility claims.

Figures

Figures reproduced from arXiv: 2605.31167 by Alessio Buscemi, Alfredo Capozucca, Barbara Delacroix, German Castignani, Tom Lucas.

**Figure 2.** Figure 2: Configuration UI for defining practitioner profiles and metric strategies. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of LLM-FACETS. Blue zone: browser client (IndexedDB stores API keys and results locally). Green zone: self-hosted Next.js server process; all computation stays within user infrastructure—BLEU, ROUGE, METEOR, and BERTScore produce no outbound calls. Orange zone: external LLM APIs, contacted only for LLM-based metrics (Jury, RAG Triad, G-Eval, LogProbs); data exposure is mitigated by the anonymi… view at source ↗

**Figure 4.** Figure 4: Benchmark analysis dashboard (Overview tab): radar chart aggregating nine primary metrics, box plots showing per-metric [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Main dashboard showing the metrics selection grid. Users can navigate between categories (Traditional, Neural, LLM-as-a [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Dataset explorer interface showing available datasets (SQuAD v2, PsiloQA), download status, split selection, and live row [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Jury module: three judges from different provider families (OpenAI, Google, Alibaba) independently score the same input [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: RAG Triad evaluation interface showing a complete example: the user-provided question, retrieved context, and generated [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Token-level log-probability visualization: a generated response with per-token confidence color coding (green: high, yellow: [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

Assessing whether Large Language Models outputs are factually grounded, epistemically calibrated, and methodologically reproducible is a prerequisite for responsible AI deployment. Yet auditing LLMs remains inaccessible to non-technical practitioners: existing tools require programming expertise and non-trivial environment setup, and cloud-hosted platforms transmit evaluation data to external services, creating barriers for domain experts and compliance officers legally responsible for AI oversight. We introduce LLM-FACETS (LLM FActuality Cross-EvaluaTion System): an open-source framework with a browser-accessible interface and a plugin architecture, structured around three practitioner profiles (technical experts, domain experts, compliance officers) that mirror the stakeholder categories identified in the EU AI Act and the NIST AI Risk Management Framework. The architecture makes data flows explicit: deterministic metrics (BLEU, ROUGE, BERTScore) run entirely within the self-hosted server with no outbound transmission; LLM-judge metrics contact external APIs explicitly, with users retaining full credential control. The framework operationalizes transparency through three mechanisms: token-level log-probability visualization for epistemic uncertainty, multi-judge consensus to mitigate judge bias, and RAG Triad metrics (Faithfulness, Answer Relevance, Context Relevance) to detect and localize hallucinations. A plugin architecture allows any new metric or dataset to be integrated without modifying the evaluation pipeline. The open-source implementation enables cross-checking across multiple metrics targeting the same property, ensuring reproducibility and decoupling AI accountability from the teams building the systems assessed. We verify the framework through cross-validation of 18 metric implementations against canonical reference libraries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM-FACETS packages standard metrics into a self-hosted framework with explicit data flows and stakeholder profiles, but supplies no evidence that the privacy isolation or non-technical accessibility actually holds.

read the letter

The paper's core offering is an open-source evaluation system that runs deterministic metrics like BLEU and BERTScore inside a self-hosted server with no outbound calls, while routing LLM-judge calls through user-controlled credentials. It adds a browser UI, a plugin system for new metrics, and three user profiles drawn from EU AI Act and NIST categories. Token-level log-prob visuals, multi-judge consensus, and RAG Triad checks are included as transparency mechanisms.

The cross-validation of 18 metric implementations against reference libraries is a straightforward and useful check on numerical correctness. Making the data-flow distinctions explicit is also a clear step forward from cloud-only tools.

The gaps are straightforward. The privacy and accessibility claims rest entirely on the architecture description; the only reported verification is metric agreement, with nothing shown about network isolation, plugin loader behavior, credential handling, or actual deployment steps. If the server requires Docker, port setup, or Python environment work, the claim that compliance officers can use it without technical help is unsupported. The individual components (RAG metrics, multi-judge) are drawn from prior work, so the contribution is the integration rather than new methods.

This is aimed at teams that need an off-the-shelf audit tool rather than researchers developing new evaluation theory. It deserves a referee if the full code and deployment instructions are provided, because the explicit separation of metric types is worth checking in practice. Otherwise the central operational claims remain untested.

Referee Report

2 major / 0 minor

Summary. The paper introduces LLM-FACETS, an open-source framework with a browser-accessible interface and plugin architecture for evaluating LLM outputs on factuality, epistemic calibration, and reproducibility. It targets three stakeholder profiles aligned with EU AI Act and NIST guidelines, makes data flows explicit (deterministic metrics like BLEU/ROUGE/BERTScore run locally with no outbound transmission; LLM-judge metrics require explicit user credentials), and operationalizes transparency via token-level log-probability visualization, multi-judge consensus, and RAG Triad metrics. The framework is verified solely through cross-validation of 18 metric implementations against reference libraries.

Significance. If the architecture claims hold, the work could meaningfully lower barriers for non-technical compliance officers and domain experts to perform reproducible LLM audits without external data transmission, directly supporting regulatory requirements for accountability. The explicit separation of metric types and plugin extensibility are practical strengths that could enable independent cross-checking.

major comments (2)

[Abstract] Abstract: The verification statement reports only cross-validation of 18 metric implementations against canonical reference libraries. This addresses numerical agreement for deterministic metrics but supplies no evidence on network isolation, absence of unlisted HTTP calls in metric modules or the plugin loader, credential handling for LLM-judge paths, or the actual deployment steps required to run the self-hosted server and browser UI. These untested elements are load-bearing for the central privacy-preservation and accessibility claims.
[Abstract] Abstract (architecture description): The claim that deterministic metrics execute with zero outbound transmission inside a self-hosted server while remaining usable by non-technical practitioners is presented as an architectural property but is not accompanied by any test results, deployment logs, or setup instructions that would confirm the absence of new leakage paths or non-trivial configuration requirements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support of the privacy and deployment claims. We address each major comment below and will revise the manuscript accordingly to include additional verification details.

read point-by-point responses

Referee: [Abstract] Abstract: The verification statement reports only cross-validation of 18 metric implementations against canonical reference libraries. This addresses numerical agreement for deterministic metrics but supplies no evidence on network isolation, absence of unlisted HTTP calls in metric modules or the plugin loader, credential handling for LLM-judge paths, or the actual deployment steps required to run the self-hosted server and browser UI. These untested elements are load-bearing for the central privacy-preservation and accessibility claims.

Authors: We agree that the current verification is limited to metric cross-validation and does not empirically demonstrate network isolation or deployment properties. The architecture description states that deterministic metrics run locally with no outbound transmission and that LLM-judge paths require explicit user credentials, but these are presented as design properties without supporting tests. In revision we will add a dedicated verification subsection with: (1) network traffic captures during execution of deterministic metrics, (2) static code analysis results confirming absence of unlisted HTTP endpoints in metric and plugin modules, (3) credential-handling audit for LLM-judge paths, and (4) minimal deployment logs and setup instructions for the self-hosted server and browser UI. These additions will directly address the load-bearing claims. revision: yes
Referee: [Abstract] Abstract (architecture description): The claim that deterministic metrics execute with zero outbound transmission inside a self-hosted server while remaining usable by non-technical practitioners is presented as an architectural property but is not accompanied by any test results, deployment logs, or setup instructions that would confirm the absence of new leakage paths or non-trivial configuration requirements.

Authors: We acknowledge that the manuscript presents the zero-outbound-transmission property as an architectural guarantee without accompanying empirical evidence or deployment artifacts. While the design isolates deterministic metrics inside the self-hosted server and requires explicit user action for external API calls, the absence of test results and setup instructions is a genuine gap. We will revise the manuscript to include network-isolation test results, sample deployment logs showing no external traffic, and concise setup instructions that demonstrate usability for non-technical practitioners without introducing new configuration leakage paths. revision: yes

Circularity Check

0 steps flagged

No circularity: framework description with external metric validation only

full rationale

The paper introduces LLM-FACETS as an open-source framework with browser interface, plugin architecture, and explicit data flows for privacy. It operationalizes transparency via token-level visualization, multi-judge consensus, and RAG Triad metrics, then verifies via cross-validation of 18 metric implementations against canonical reference libraries. No equations, derivations, fitted parameters, predictions, or self-citations appear in the provided text. The central claims rest on architectural description and external library comparisons rather than any reduction of outputs to inputs by construction. This is a standard non-circular framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract, which focuses on software architecture and metric integration rather than theoretical constructs or fitted quantities.

pith-pipeline@v0.9.1-grok · 5823 in / 1184 out tokens · 19504 ms · 2026-06-28T22:33:00.364194+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Apache Software Foundation. 2013. Apache Parquet: Columnar Storage Format. https://parquet.apache.org

2013
[2]

Arize AI. 2024. Phoenix: Open-Source AI Observability Platform. [software]. https://phoenix.arize.com

2024
[3]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, USA, 65–72

2005
[4]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT). ACM, Virtual Event, Canada, 610–623. doi:10.1145/3442188.3445922

work page doi:10.1145/3442188.3445922 2021
[5]

Confident AI. 2024. DeepEval: Open-Source LLM Evaluation Framework. Version 1.0 [software]. https://github.com/confident-ai/deepeval

2024
[6]

Credo AI. 2024. Credo AI: AI Governance Platform. https://www.credo.ai

2024
[7]

Vera Liao, Michael Muller, Mark O

Upol Ehsan, Q. Vera Liao, Michael Muller, Mark O. Riedl, and Justin D. Weisz. 2021. Expanding Explainability: Towards Social Transparency in AI Systems. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama, Japan, 1–19. doi:10.1145/3411764.3445188 Manuscript submitted to ACM LLM-FACETS: Privacy-Preserving LLM Evalu...

work page doi:10.1145/3411764.3445188 2021
[8]

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2024. RAGAS: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Association for Computational Linguistics, Malta, 150–158. doi:10.18653/v1/2024.eacl-demo.16

work page doi:10.18653/v1/2024.eacl-demo.16 2024
[9]

European Parliament and Council of the European Union. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council on the Protection of Natural Persons with Regard to the Processing of Personal Data (General Data Protection Regulation). Official Journal of the European Union L 119/1. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CEL...

2016
[10]

European Parliament and Council of the European Union. 2024. Regulation (EU) 2024/1689 of the European Parliament and of the Council – Artificial Intelligence Act. Official Journal of the European Union. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689

2024
[11]

Fiddler AI. 2024. Fiddler AI: Model Performance Management Platform. https://www.fiddler.ai

2024
[12]

Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. 2023. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text.Journal of Artificial Intelligence Research77 (2023), 103–166. doi:10.1613/jair.1.13715

work page doi:10.1613/jair.1.13715 2023
[13]

Google DeepMind. 2024. Gemini: A Family of Highly Capable Multimodal Models. https://deepmind.google/technologies/gemini/

2024
[14]

International Organization for Standardization. 2023. ISO/IEC 42001:2023 — Information Technology — Artificial Intelligence — Management System. https://www.iso.org/standard/81230.html

2023
[15]

Minsuk Kahng, Ian Tenney, Mark Neumann, Jaime Wexler, Fernanda Viégas, and Martin Wattenberg. 2024. LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, Hawaii, USA, 1–7. doi:10.1145/3613905.3650755

work page doi:10.1145/3613905.3650755 2024
[16]

Spencer Kelly. 2016. compromise: Modest Natural Language Processing for JavaScript. [software]. https://github.com/spencermountain/compromise

2016
[17]

LangChain, Inc. 2023. LangSmith: Platform for Building Production-Grade LLM Applications. https://smith.langchain.com

2023
[18]

Langfuse. 2024. Langfuse: Open Source LLM Engineering Platform. [software]. https://langfuse.com

2024
[19]

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 6449–6464. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023
[20]

Vera Liao, Daniel Gruen, and Sarah Miller

Q. Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: Informing Design Practices for Explainable AI User Experiences. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, Hawaii, USA, 1–15. doi:10.1145/3313831.3376590

work page doi:10.1145/3313831.3376590 2020
[21]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013

2004
[22]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Singapore, 2511–2522. doi:10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[23]

Microsoft Corporation. 2018. Presidio: Context-Aware, Pluggable and Customizable Data Protection and Anonymization Service for Text and Images. [software]. https://microsoft.github.io/presidio/

2018
[24]

2023.Artificial Intelligence Risk Management Framework (AI RMF 1.0)

National Institute of Standards and Technology. 2023.Artificial Intelligence Risk Management Framework (AI RMF 1.0). Technical Report NIST AI 100-1. National Institute of Standards and Technology, Gaithersburg, Maryland, USA. doi:10.6028/NIST.AI.100-1

work page doi:10.6028/nist.ai.100-1 2023
[25]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[26]

Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. InProceedings of the Third Conference on Machine Translation (WMT). Association for Computational Linguistics, Brussels, Belgium, 186–191. doi:10.18653/v1/W18-6319

work page doi:10.18653/v1/w18-6319 2018
[27]

Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: An Embeddable Analytical Database. InProceedings of the 2019 ACM SIGMOD International Conference on Management of Data. ACM, Amsterdam, The Netherlands, 1981–1984. doi:10.1145/3299869.3320212

work page doi:10.1145/3299869.3320212 2019
[28]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Melbourne, Australia, 784–789. doi:10.18653/v1/P18-2124

work page doi:10.18653/v1/p18-2124 2018
[29]

Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, and Julia Belikova. 2025. When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA. HuggingFace Datasets. https://huggingface.co/datasets/s-nlp/PsiloQA. doi:10.48550/arXiv.2510.04849

work page doi:10.48550/arxiv.2510.04849 2025
[30]

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). Association for Computational Linguistics, M...

work page doi:10.18653/v1/2024.naacl-long.20 2024
[31]

TruEra, Inc. 2024. TruLens: Evaluation and Tracking for LLM Experiments. [software]. https://github.com/truera/trulens

2024
[32]

Igor Tufanov, Karen Hambardzumyan, Javier Ferrando, and Elena Voita. 2024. LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 3: System Demonstrations. Association for Computational Linguistics, Bangkok, Thailand, 29–41. d...

work page doi:10.18653/v1/2024.acl-demos.6 2024
[33]

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv preprint arXiv:2404.18796. Manuscript submitted to ACM 28 Lucas et al. doi:10.48550/arXiv.2404.18796

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.18796 2024
[34]

Xenova. 2024. Transformers.js: State-of-the-Art Machine Learning for the Web. Version 2.17.2 [software]. https://github.com/xenova/transformers.js

2024
[35]

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. 2023. Do Large Language Models Know What They Don’t Know?. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 8653–8665. doi:10.18653/v1/2023.findings-acl.551

work page doi:10.18653/v1/2023.findings-acl.551 2023
[36]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR). OpenReview.net, Addis Ababa, Ethiopia, 1–15. https://openreview.net/forum?id= SkeHuCVFDr

2020
[37]

Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, and Frederic Sala. 2026. CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation. arXiv preprint arXiv:2603.00039. doi:10.48550/arXiv.2603.00039

work page doi:10.48550/arxiv.2603.00039 2026
[38]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., New Orleans, Louisiana, USA...

2023

[1] [1]

Apache Software Foundation. 2013. Apache Parquet: Columnar Storage Format. https://parquet.apache.org

2013

[2] [2]

Arize AI. 2024. Phoenix: Open-Source AI Observability Platform. [software]. https://phoenix.arize.com

2024

[3] [3]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, USA, 65–72

2005

[4] [4]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT). ACM, Virtual Event, Canada, 610–623. doi:10.1145/3442188.3445922

work page doi:10.1145/3442188.3445922 2021

[5] [5]

Confident AI. 2024. DeepEval: Open-Source LLM Evaluation Framework. Version 1.0 [software]. https://github.com/confident-ai/deepeval

2024

[6] [6]

Credo AI. 2024. Credo AI: AI Governance Platform. https://www.credo.ai

2024

[7] [7]

Vera Liao, Michael Muller, Mark O

Upol Ehsan, Q. Vera Liao, Michael Muller, Mark O. Riedl, and Justin D. Weisz. 2021. Expanding Explainability: Towards Social Transparency in AI Systems. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama, Japan, 1–19. doi:10.1145/3411764.3445188 Manuscript submitted to ACM LLM-FACETS: Privacy-Preserving LLM Evalu...

work page doi:10.1145/3411764.3445188 2021

[8] [8]

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2024. RAGAS: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Association for Computational Linguistics, Malta, 150–158. doi:10.18653/v1/2024.eacl-demo.16

work page doi:10.18653/v1/2024.eacl-demo.16 2024

[9] [9]

European Parliament and Council of the European Union. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council on the Protection of Natural Persons with Regard to the Processing of Personal Data (General Data Protection Regulation). Official Journal of the European Union L 119/1. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CEL...

2016

[10] [10]

European Parliament and Council of the European Union. 2024. Regulation (EU) 2024/1689 of the European Parliament and of the Council – Artificial Intelligence Act. Official Journal of the European Union. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689

2024

[11] [11]

Fiddler AI. 2024. Fiddler AI: Model Performance Management Platform. https://www.fiddler.ai

2024

[12] [12]

Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. 2023. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text.Journal of Artificial Intelligence Research77 (2023), 103–166. doi:10.1613/jair.1.13715

work page doi:10.1613/jair.1.13715 2023

[13] [13]

Google DeepMind. 2024. Gemini: A Family of Highly Capable Multimodal Models. https://deepmind.google/technologies/gemini/

2024

[14] [14]

International Organization for Standardization. 2023. ISO/IEC 42001:2023 — Information Technology — Artificial Intelligence — Management System. https://www.iso.org/standard/81230.html

2023

[15] [15]

Minsuk Kahng, Ian Tenney, Mark Neumann, Jaime Wexler, Fernanda Viégas, and Martin Wattenberg. 2024. LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, Hawaii, USA, 1–7. doi:10.1145/3613905.3650755

work page doi:10.1145/3613905.3650755 2024

[16] [16]

Spencer Kelly. 2016. compromise: Modest Natural Language Processing for JavaScript. [software]. https://github.com/spencermountain/compromise

2016

[17] [17]

LangChain, Inc. 2023. LangSmith: Platform for Building Production-Grade LLM Applications. https://smith.langchain.com

2023

[18] [18]

Langfuse. 2024. Langfuse: Open Source LLM Engineering Platform. [software]. https://langfuse.com

2024

[19] [19]

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 6449–6464. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023

[20] [20]

Vera Liao, Daniel Gruen, and Sarah Miller

Q. Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: Informing Design Practices for Explainable AI User Experiences. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, Hawaii, USA, 1–15. doi:10.1145/3313831.3376590

work page doi:10.1145/3313831.3376590 2020

[21] [21]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013

2004

[22] [22]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Singapore, 2511–2522. doi:10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023

[23] [23]

Microsoft Corporation. 2018. Presidio: Context-Aware, Pluggable and Customizable Data Protection and Anonymization Service for Text and Images. [software]. https://microsoft.github.io/presidio/

2018

[24] [24]

2023.Artificial Intelligence Risk Management Framework (AI RMF 1.0)

National Institute of Standards and Technology. 2023.Artificial Intelligence Risk Management Framework (AI RMF 1.0). Technical Report NIST AI 100-1. National Institute of Standards and Technology, Gaithersburg, Maryland, USA. doi:10.6028/NIST.AI.100-1

work page doi:10.6028/nist.ai.100-1 2023

[25] [25]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002

[26] [26]

Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. InProceedings of the Third Conference on Machine Translation (WMT). Association for Computational Linguistics, Brussels, Belgium, 186–191. doi:10.18653/v1/W18-6319

work page doi:10.18653/v1/w18-6319 2018

[27] [27]

Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: An Embeddable Analytical Database. InProceedings of the 2019 ACM SIGMOD International Conference on Management of Data. ACM, Amsterdam, The Netherlands, 1981–1984. doi:10.1145/3299869.3320212

work page doi:10.1145/3299869.3320212 2019

[28] [28]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Melbourne, Australia, 784–789. doi:10.18653/v1/P18-2124

work page doi:10.18653/v1/p18-2124 2018

[29] [29]

Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, and Julia Belikova. 2025. When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA. HuggingFace Datasets. https://huggingface.co/datasets/s-nlp/PsiloQA. doi:10.48550/arXiv.2510.04849

work page doi:10.48550/arxiv.2510.04849 2025

[30] [30]

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). Association for Computational Linguistics, M...

work page doi:10.18653/v1/2024.naacl-long.20 2024

[31] [31]

TruEra, Inc. 2024. TruLens: Evaluation and Tracking for LLM Experiments. [software]. https://github.com/truera/trulens

2024

[32] [32]

Igor Tufanov, Karen Hambardzumyan, Javier Ferrando, and Elena Voita. 2024. LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 3: System Demonstrations. Association for Computational Linguistics, Bangkok, Thailand, 29–41. d...

work page doi:10.18653/v1/2024.acl-demos.6 2024

[33] [33]

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv preprint arXiv:2404.18796. Manuscript submitted to ACM 28 Lucas et al. doi:10.48550/arXiv.2404.18796

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.18796 2024

[34] [34]

Xenova. 2024. Transformers.js: State-of-the-Art Machine Learning for the Web. Version 2.17.2 [software]. https://github.com/xenova/transformers.js

2024

[35] [35]

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. 2023. Do Large Language Models Know What They Don’t Know?. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 8653–8665. doi:10.18653/v1/2023.findings-acl.551

work page doi:10.18653/v1/2023.findings-acl.551 2023

[36] [36]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR). OpenReview.net, Addis Ababa, Ethiopia, 1–15. https://openreview.net/forum?id= SkeHuCVFDr

2020

[37] [37]

Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, and Frederic Sala. 2026. CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation. arXiv preprint arXiv:2603.00039. doi:10.48550/arXiv.2603.00039

work page doi:10.48550/arxiv.2603.00039 2026

[38] [38]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., New Orleans, Louisiana, USA...

2023