arxiv: 2604.27309 · v1 · submitted 2026-04-30 · 💻 cs.AI

Recognition: unknown

End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

Aaryan Shah, Aida Rutledge, Alexia Downs, Andrew Hines, Denis Bajet, Elizabeth Jimenez, Fabiano Araujo, Laura Offutt, Paulius Mui

Pith reviewed 2026-05-07 10:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords clinical AI governanceEHR-embedded agentambient audiorubric validationlive feedbackcontrolled experimentsAI deployment monitoring

0 comments

The pith

A governance framework using clinician rubrics, live feedback, and controlled experiments makes continuous oversight of clinical AI agents achievable and effective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that clinical AI requires ongoing governance after deployment rather than one-time checks, and that a multi-channel system can deliver this in practice. The framework combines validated rubrics written by clinicians, real-time user feedback entries, technical metrics such as processing time, cost tracking, and controlled experiments that test changes before they go live. When applied to an agent that turns ambient audio of clinician conversations into structured electronic health record updates, quality scores rose from 84 percent to 95 percent median across seven versions while live feedback shifted from mostly error reports to mostly positive observations. A reader would care because AI tools in medicine must remain reliable as they encounter new cases and users, or else they risk introducing errors into patient records.

Core claim

The authors demonstrate an integrated governance pipeline for an EHR-embedded agent that converts ambient audio into structured chart updates. Twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven versions underwent controlled experiments that raised median quality scores from 84 percent to 95 percent. Analysis of 107 live feedback entries over three months showed error reports falling from 79 percent to 30 percent and positive observations rising to 45 percent as engineering fixes addressed issues. Median processing time per audio segment reached 8.1 seconds with a 99.6 percent effective completion rate after retry mechanisms handled transient errors. The work concludes

What carries the argument

The end-to-end governance framework that integrates rubric validation, live deployment feedback, technical performance monitoring, cost tracking, and controlled experimentation to gate system changes.

If this is right

Quality scores for the AI agent rise systematically when governance data drives targeted engineering interventions.
The share of positive feedback increases and error reports decrease as specific failures are identified and resolved over months of use.
High completion rates and acceptable processing speeds can be sustained by adding retry logic for transient model errors.
New versions of the agent can be tested in controlled experiments and blocked from full deployment if they do not meet rubric thresholds.
Cost tracking alongside quality metrics supports decisions about whether to maintain or scale the deployed agent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same combination of rubrics and live feedback could be applied to other clinical AI tools such as diagnostic suggestion systems to keep their outputs reliable after initial release.
Hospitals might use the cost-tracking component to weigh quality gains against added compute expenses when choosing which AI agents to keep in production.
Early heavy error feedback followed by later positive notes indicates the framework can build clinician trust over time rather than only at launch.
Testing whether the rubric approach transfers to different medical specialties or note formats would reveal how broadly the governance method applies.

Load-bearing premise

Clinician-authored rubrics and live feedback entries give an unbiased and complete picture of clinical note quality that holds beyond the twenty participants and the cases they reviewed.

What would settle it

A replication with a larger and more diverse set of clinicians and patient cases in which quality scores fail to rise or live feedback stays dominated by error reports despite the same engineering interventions would show the governance approach does not produce reliable gains.

Figures

Figures reproduced from arXiv: 2604.27309 by Aaryan Shah, Aida Rutledge, Alexia Downs, Andrew Hines, Denis Bajet, Elizabeth Jimenez, Fabiano Araujo, Laura Offutt, Paulius Mui.

**Figure 1.** Figure 1: (A) Feedback theme distribution across 107 entries. Command generation failures were the most view at source ↗

**Figure 2.** Figure 2: UI redesign for session controls. (A) Original interface with text-based controls. (B) Redesigned view at source ↗

**Figure 3.** Figure 3: Governance framework integrating rubric validation, live clinician feedback, technical performance view at source ↗

**Figure 4.** Figure 4: Hyperscribe processing pipeline. Four stages form a cyclical process: audio to transcript, transcript view at source ↗

**Figure 5.** Figure 5: Four design principles enabling governability: structured outputs, explicit intermediate reasoning, view at source ↗

**Figure 6.** Figure 6: Rubric methodology workflow (reproduced from companion paper). Two parallel paths for rubric view at source ↗

read the original abstract

Clinical AI systems require not just point-in-time evaluation but continuous governance: the ongoing practice of monitoring, evaluating, iterating, and re-evaluating performance throughout deployment. We present an end-to-end framework of governance that integrates rubric validation, live deployment feedback, technical performance monitoring, and cost tracking, with controlled experimentation gating system changes before deployment. Applied to Hyperscribe, an EHR-embedded agent that converts ambient audio into structured chart updates, twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven Hyperscribe versions were evaluated through controlled experiments, with median scores improving from 84% to 95%. Analysis of 107 live feedback entries over three months showed feedback composition shifting from 79% error reports and 14% positive observations to 30% errors and 45% positive observations as engineering interventions resolved failures. Median processing time per audio segment was 8.1 seconds with a 99.6% effective completion rate after retry mechanisms absorbed transient model errors. These results demonstrate that continuous, multi-channel governance of deployed clinical AI is both achievable and effective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical case study of continuous governance for an EHR audio-to-note agent with before-and-after metrics, but the quality gains rest entirely on internal clinician rubrics without external validation.

read the letter

The paper walks through a deployed ambient AI system that turns clinician audio into structured EHR notes and shows how they ran ongoing evaluation with rubrics, live feedback, latency tracking, and gated version rollouts. They report seven versions tested, median rubric scores rising from 84% to 95%, and feedback shifting from mostly error reports to more positive notes over three months, plus clean technical numbers like 8.1-second median processing and 99.6% completion after retries. Those concrete counts and the end-to-end loop are the useful part; it gives a working template for teams that need to keep an eye on a live clinical tool instead of stopping at pre-launch tests. The numbers come from 1,646 rubrics across 823 cases and 107 feedback entries from the same twenty clinicians, which is a real deployment trace rather than simulated data. The limitation is straightforward: everything about note quality is scored by the users themselves with no inter-rater reliability numbers, no blinded external review, and no link to downstream clinical outcomes. That leaves room for rater drift or selection effects to explain the shifts, even if the technical metrics hold up. The work is aimed at applied clinical AI groups who are already shipping tools and need ideas for monitoring infrastructure. It is worth sending to referees because the implementation details are specific enough to be actionable, though any review would need to press on the validation gap.

Referee Report

3 major / 2 minor

Summary. The paper describes an end-to-end governance framework for Hyperscribe, an EHR-embedded AI agent that converts ambient audio into structured clinical notes. Twenty clinicians authored 1,646 validated rubrics across 823 cases to evaluate seven iterative versions of the system via controlled experiments, reporting median rubric scores rising from 84% to 95%. Analysis of 107 live feedback entries over three months showed a shift from 79% error reports to 45% positive observations. Technical metrics include 8.1-second median latency and 99.6% completion rate after retries. The authors conclude that continuous, multi-channel governance (rubric validation, live feedback, technical monitoring, cost tracking, and gated experimentation) is both achievable and effective for deployed clinical AI.

Significance. If the clinician rubrics and feedback reliably capture clinical note quality, the work offers a concrete, multi-channel example of post-deployment governance for AI in healthcare—an area where point-in-time evaluation is widely recognized as insufficient. The integration of rubric-based controlled experiments with live deployment logs and objective technical metrics provides a practical template that could inform regulatory and operational practices for clinical AI agents.

major comments (3)

[Abstract and Results sections describing rubric and feedback analysis] The central claim that governance is 'effective' (Abstract and conclusion) rests on rubric-score improvements (84% to 95%) and feedback-composition shifts (79% errors to 45% positive). These metrics are generated entirely by the same 20 participating clinicians who authored the 1,646 rubrics and 107 feedback entries. No inter-rater reliability statistics, blinded independent review, or correlation with downstream clinical outcomes (e.g., note accuracy against gold-standard charts or patient outcomes) are reported, leaving open the possibility that observed gains reflect rater drift, selection effects, or expectation bias rather than genuine quality gains.
[Results on rubric scores and live feedback] The manuscript reports directional improvements across seven versions but provides no statistical tests (e.g., Wilcoxon signed-rank or mixed-effects models accounting for clinician and case clustering) for the median score changes or feedback-composition shifts. Without these, it is unclear whether the reported gains exceed what would be expected from sampling variability or case-selection effects in the 823 cases.
[Methods describing rubric creation and case selection] The 823 cases and 20 clinicians are described as the source of all rubrics and feedback, yet no details are given on case-selection criteria, potential selection bias, or how representative the sample is of broader clinical workflows. This directly affects the generalizability of the claim that the governance framework is effective beyond this specific deployment.

minor comments (2)

[Abstract and Methods] The abstract and main text use 'median scores' and 'feedback composition' without specifying the exact rubric scale (e.g., binary pass/fail per item or continuous) or how positive/negative feedback was categorized, which would aid reproducibility.
[Results on technical performance] Technical metrics (latency, completion rate) are presented as objective but the paper does not discuss how retry mechanisms or transient model errors were logged or whether they affected clinician-perceived quality.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed review of our manuscript on the end-to-end governance framework for Hyperscribe. The comments raise important points regarding the interpretation of our evaluation metrics, statistical rigor, and generalizability. We address each major comment below with clarifications drawn directly from the study design and indicate specific revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Results sections describing rubric and feedback analysis] The central claim that governance is 'effective' (Abstract and conclusion) rests on rubric-score improvements (84% to 95%) and feedback-composition shifts (79% errors to 45% positive). These metrics are generated entirely by the same 20 participating clinicians who authored the 1,646 rubrics and 107 feedback entries. No inter-rater reliability statistics, blinded independent review, or correlation with downstream clinical outcomes (e.g., note accuracy against gold-standard charts or patient outcomes) are reported, leaving open the possibility that observed gains reflect rater drift, selection effects, or expectation bias rather than genuine quality gains.

Authors: The evaluation design intentionally centers on structured, clinician-authored rubrics as the primary channel for assessing note quality in a real-world deployment, consistent with the multi-channel governance framework we describe. The 20 clinicians are practicing experts whose assessments directly inform system iterations. We agree, however, that the absence of inter-rater reliability statistics, blinded external review, and linkage to downstream clinical outcomes leaves room for alternative explanations such as rater drift or bias. We will add a dedicated Limitations section that explicitly discusses these issues, qualifies the scope of our effectiveness claims to the observed operational improvements within this framework, and notes the value of future studies incorporating such validations. revision: partial
Referee: [Results on rubric scores and live feedback] The manuscript reports directional improvements across seven versions but provides no statistical tests (e.g., Wilcoxon signed-rank or mixed-effects models accounting for clinician and case clustering) for the median score changes or feedback-composition shifts. Without these, it is unclear whether the reported gains exceed what would be expected from sampling variability or case-selection effects in the 823 cases.

Authors: We accept that formal statistical testing would improve the presentation of the iterative improvements. The medians were chosen to summarize the progression across controlled experiments with the same clinicians evaluating successive versions. In the revised manuscript we will add appropriate non-parametric tests (e.g., Wilcoxon signed-rank for paired version comparisons) and note any clustering considerations, thereby clarifying whether the observed shifts exceed sampling variability. revision: yes
Referee: [Methods describing rubric creation and case selection] The 823 cases and 20 clinicians are described as the source of all rubrics and feedback, yet no details are given on case-selection criteria, potential selection bias, or how representative the sample is of broader clinical workflows. This directly affects the generalizability of the claim that the governance framework is effective beyond this specific deployment.

Authors: We will expand the Methods section to detail the case-selection process, clinician recruitment criteria, inclusion/exclusion rules, and any steps taken to align the sample with typical clinical workflows. This addition will better support readers in assessing generalizability while acknowledging the single-site, real-world deployment context of the study. revision: yes

standing simulated objections not resolved

The study did not collect inter-rater reliability statistics, blinded independent reviews, or correlations with downstream clinical outcomes; these cannot be supplied without additional data collection.

Circularity Check

0 steps flagged

No circularity: empirical results independent of internal definitions

full rationale

The paper reports measured improvements in rubric scores (84% to 95%) and feedback composition shifts from live deployment logs and clinician-authored rubrics across 1,646 entries. No equations, fitted parameters, derivations, or self-citations are present that reduce any claim to its own inputs by construction. The governance framework is evaluated through external controlled experiments and feedback analysis rather than self-referential definitions or renamed known results. This is a standard empirical evaluation with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and reports observed outcomes from a deployed system; it introduces no mathematical free parameters, formal axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5519 in / 1205 out tokens · 45052 ms · 2026-05-07T10:26:51.172742+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 12 canonical work pages

[1]

Uptake of generative AI integrated with electronic health records in US hospitals.JAMA Netw Open

Everson J, Nong P, Richwine C. Uptake of generative AI integrated with electronic health records in US hospitals.JAMA Netw Open. 2025;8;(12):e2549463. doi:10.1001/jamanetworkopen.2025.49463

work page doi:10.1001/jamanetworkopen.2025.49463 2025
[2]

Physician survey on augmented intelligence

American Medical Association. Physician survey on augmented intelligence. Mar 2026. Available from www.ama-assn.org/system/files/physician-ai-sentiment-report.pdf

2026
[3]

Monitoring performance of clinical artificial intelligence in health care: a scoping review

Andersen ES, Hardt J, Nielsen C, et al. Monitoring performance of clinical artificial intelligence in health care: a scoping review.JBI Evidence Synthesis. 2024;22(12):2362–2419. doi:10.11124/JBIES-24-00042

work page doi:10.11124/jbies-24-00042 2024
[4]

2024;52(3):90–100

HassanN,NayakA,GohE,etal.Addressingthechallengeofgovernance, risk, andcomplianceinrespon- sible AI.IEEE Engineering Management Review. 2024;52(3):90–100. doi:10.1109/EMR.2024.3407478

work page doi:10.1109/emr.2024.3407478 2024
[5]

Evaluating clinical AI summaries with large language models as judges

Croxford E, Gao M, Pellegrini E, et al. Evaluating clinical AI summaries with large language models as judges.npj Digital Medicine. 2025;8:640. doi:10.1038/s41746-025-02005-2

work page doi:10.1038/s41746-025-02005-2 2025
[6]

Introducing HealthBench

Arora R, Chaurasia A, Pfeffer MA, et al. Introducing HealthBench. OpenAI; 2025. Available from: https://openai.com/index/healthbench/

2025
[7]

A framework for human evaluation of large language models in healthcare derived from literature review.npj Digital Medicine

Tam TYC, Sivarajkumar S, Kapoor S, et al. A framework for human evaluation of large language models in healthcare derived from literature review.npj Digital Medicine. 2024;7:258. doi:10.1038/s41746-024- 01258-7

work page doi:10.1038/s41746-024- 2024
[8]

An evaluation framework for ambient digital scribing tools in clinical applications.npj Digital Medicine

Wang W, et al. An evaluation framework for ambient digital scribing tools in clinical applications.npj Digital Medicine. 2025;8:358. doi:10.1038/s41746-025-01622-1

work page doi:10.1038/s41746-025-01622-1 2025
[9]

A general framework for governing marketed AI/ML medical devices.npj Digital Medicine

Babic B, et al. A general framework for governing marketed AI/ML medical devices.npj Digital Medicine. 2025;8:328. doi:10.1038/s41746-025-01717-9

work page doi:10.1038/s41746-025-01717-9 2025
[10]

Hyperscribe

Canvas Medical. Hyperscribe. 2025. Available from:https://canvasmedical.com/extensions/ hyperscribe

2025
[11]

canvas-hyperscribe

Canvas Medical. canvas-hyperscribe. GitHub; 2025. Available from:https://github.com/ canvas-medical/canvas-hyperscribe

2025
[12]

Canvas SDK Commands

Canvas Medical. Canvas SDK Commands. Available from:https://docs.canvasmedical.com/sdk/ commands/
[13]

Case-Specific Rubrics for Clinical AI Evaluation: Method- ology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Shah A, Hines A, Downs A, Bajet D, et al. Case-Specific Rubrics for Clinical AI Evaluation: Method- ology, Validation, and LLM-Clinician Agreement Across 823 Encounters. [Companion paper–preprint pending] 17
[14]

Expert-Annotated Clinical Encounters: A Benchmark Dataset for Evaluating AI- Generated Medical Documentation

Shah A, et al. Expert-Annotated Clinical Encounters: A Benchmark Dataset for Evaluating AI- Generated Medical Documentation. PhysioNet. [To be added upon release]
[15]

Expert-Annotated Clinical Encounters: Dataset Usage

Shah A, et al. Expert-Annotated Clinical Encounters: Dataset Usage. GitHub; 2026. Available from: https://github.com/canvas-medical/dataset-usage

2026
[16]

Advancing healthcare AI governance through a comprehensive maturity model based on systematic review.npj Digital Medicine

Hussein R, Schein O, Bates DW, et al. Advancing healthcare AI governance through a comprehensive maturity model based on systematic review.npj Digital Medicine. 2026;9:236. doi:10.1038/s41746-026- 02418-7

work page doi:10.1038/s41746-026- 2026
[17]

Medical domain knowledge in domain-agnostic generative AI.npj Digital Medicine, 5(1):90, 2022

Wells BJ, Walji MF, Sittig DF, et al. A practical framework for appropriate implementation and review of artificial intelligence (FAIR-AI) in healthcare.npj Digital Medicine. 2025;8:514. doi:10.1038/s41746- 025-01900-y

work page doi:10.1038/s41746- 2025
[18]

A novel playbook for pragmatic trial operations to monitor and evaluate ambient artificial intelligence in clinical practice.NEJM AI

Afshar M, Dligach D, Churpek MM, et al. A novel playbook for pragmatic trial operations to monitor and evaluate ambient artificial intelligence in clinical practice.NEJM AI. 2025;2(9). doi:10.1056/AIdbp2401267

work page doi:10.1056/aidbp2401267 2025
[19]

Lukac, William Turner, Sitaram Vangala, Aaron T

Tierney AA, Sirek A, Ufere NN, et al. Randomized clinical trial of two ambient AI scribes for clinical documentation.NEJM AI. 2025. doi:10.1056/AIoa2501000

work page doi:10.1056/aioa2501000 2025
[20]

npj Digital Medicine 2025 8:1 8:274-

Asgari E, Fernandes M, Laranjo L, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.npj Digital Medicine. 2025;8:274. doi:10.1038/s41746-025-01670-7 18

work page doi:10.1038/s41746-025-01670-7 2025