arxiv: 2604.23855 · v1 · submitted 2026-04-26 · 💻 cs.CL · cs.SE

Recognition: unknown

Learning Selective LLM Autonomy from Copilot Feedback in Enterprise Customer Support Workflows

Anatolii Potapov, Dmitry Bitman, Elisei Rykov, Nikita Borovkov, Nikita Surnachev, Olga Tsymboi, Sergei Filimonov

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:03 UTC · model grok-4.3

classification 💻 cs.CL cs.SE

keywords LLM automationcustomer support workflowscopilot feedbackselective autonomyBPM platformsabstention criticUI interaction tracesenterprise deployment

0 comments

The pith

A deployed system learns selective LLM autonomy for customer support by training on operator accept-or-correct feedback, automating 45% of sessions and cutting handling time 39% with no quality loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how an LLM system can be trained to automate end-to-end customer support tasks inside an enterprise BPM platform by using the supervision already created when operators accept or correct suggestions. It builds a next-action policy from UI traces and a critic from the same feedback to decide when to act autonomously versus defer to a human. High-confidence steps run in the background while the system pauses and resumes from any operator correction. This design lets one person oversee several sessions at once and reaches usable selective automation for a new process in about two weeks. Production results show 45% of sessions handled without human input and a 39% drop in average handling time while quality metrics stay flat.

Core claim

By collecting structured per-case UI interaction traces together with low-overhead copilot feedback, the system trains both a policy that predicts the next UI action and a critic that calibrates when to abstain. Only high-confidence actions execute in the background on the schema-driven view of the BPM interface; uncertain steps are handed to the operator and the session resumes from the corrected state. Safe fallbacks and monitoring keep the process recoverable. In live deployment this selective autonomy automated 45% of sessions, reduced average handling time by 39%, and preserved support quality levels.

What carries the argument

The staged deployment pipeline that trains a next-UI-action policy from interaction traces and learns an abstention critic from accept-or-correct copilot feedback to run only confident steps autonomously.

If this is right

One operator can supervise multiple concurrent sessions and is interrupted only when the critic withholds action.
Selective automation for a new process is reached within two weeks using only the supervision already generated during normal work.
Production monitoring plus safe fallbacks allow the system to resume cleanly from operator-corrected states.
Support quality metrics remain unchanged even as automation and throughput increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback-driven abstention mechanism could be applied to other schema-driven enterprise UIs such as order processing or compliance checks.
If feedback consistency improves with operator experience, the automation rate could rise over time without new model training.
The approach may be especially useful in regulated domains where full end-to-end autonomy is disallowed but partial background assistance is acceptable.

Load-bearing premise

Copilot feedback from operators is consistent and unbiased enough to train both the action policy and the abstention critic without introducing unrecoverable errors.

What would settle it

Deploy the same pipeline in an environment where operators are instructed to give deliberately noisy or biased corrections and measure whether the 45% automation rate and 39% time reduction still appear.

Figures

Figures reproduced from arXiv: 2604.23855 by Anatolii Potapov, Dmitry Bitman, Elisei Rykov, Nikita Borovkov, Nikita Surnachev, Olga Tsymboi, Sergei Filimonov.

**Figure 1.** Figure 1: Staged deployment for selective automation. Logging collects operator trajectories with UI context to train an initial policy. Copilot deploys the policy to propose actions with human oversight on critical steps, producing accept or override feedback. Selective automation adds a critic that scores policy proposals and executes only critic-approved critical actions. This stage runs in calibration with manda… view at source ↗

**Figure 2.** Figure 2: High-level breakdown of rejected suggestions. view at source ↗

**Figure 3.** Figure 3: Data pipeline. We first log each BPM step as an HTML snapshot that includes both the customer view at source ↗

**Figure 4.** Figure 4: Prompt template used for the prompting baseline. view at source ↗

read the original abstract

We present a deployed system that automates end-to-end customer support workflows inside an enterprise Business Process Management (BPM) platform. The approach is scalable in production and reaches selective automation within two weeks for a new process, leveraging supervision already generated at scale: structured per-case UI interaction traces and low-overhead copilot feedback, where operators either accept a suggestion or provide a correction. A staged deployment pipeline trains a next UI action policy, learns a critic from copilot feedback to calibrate abstention, and executes only high-confidence steps in the background while deferring uncertain decisions to operators and resuming from the updated UI state. This setup lets one operator supervise multiple concurrent sessions and be interrupted only when the system is uncertain. The system operates on a schema-driven view of the BPM interface and includes monitoring and safe fallbacks for production. In production, it automated 45% of sessions and reduced average handling time by 39% without degrading support quality level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers concrete production metrics for selective LLM automation in enterprise support but skimps on the methodological details needed to assess reproducibility.

read the letter

This paper reports a deployed system for selective automation of customer support workflows in an enterprise BPM platform. It uses operator copilot feedback to train both a next-action policy and an abstention critic, then runs high-confidence actions in the background with safe resumption from corrections. In production it automated 45% of sessions and cut average handling time by 39% with no quality drop. The new element is the full pipeline tailored to schema-driven BPM interfaces: training on existing UI traces, learning the critic from accept/correct signals, and the monitoring plus fallbacks for production use. It reaches usable selectivity in two weeks for new processes without extra labeling. That combination and the reported metrics are what set it apart from prior work on human feedback or selective execution. The strength is the real deployment data. Showing that one operator can supervise multiple sessions and only intervene on uncertain steps is practical evidence that the approach works at scale. The soft spots are in the evaluation. The abstract gives no information on training data volume, model details, how the confidence threshold was chosen, or statistical significance of the gains. There are also no controls mentioned for other factors like process changes or operator experience. The assumption that copilot feedback is consistent enough to train both policy and critic without bias from time pressure or skill variation is not tested or bounded in the description. If that assumption fails, the abstention might be miscalibrated and the quality invariance could be specific to this operator group. This is for teams implementing LLM copilots inside structured enterprise software. Readers who need concrete examples of selective autonomy in production will get value from the deployment story and the numbers. It deserves peer review. The production results are substantial enough to warrant referee scrutiny, particularly on the missing details around data and calibration.

Referee Report

2 major / 1 minor

Summary. The paper presents a deployed system for end-to-end automation of enterprise customer support workflows in a BPM platform. It uses structured per-case UI interaction traces and low-overhead copilot feedback (operator accept/correct signals) to train a next-action policy and an abstention critic, then executes only high-confidence steps autonomously while deferring uncertain decisions to operators and resuming from the updated UI state. The system incorporates schema-driven UI views, monitoring, and safe fallbacks. In production it automates 45% of sessions, reduces average handling time by 39%, and maintains support quality, with the ability to reach selective automation for new processes within two weeks.

Significance. If the reported production outcomes prove robust, the work offers a concrete demonstration of scalable, feedback-driven selective autonomy for LLMs in real enterprise settings. It shows how existing operator supervision can be repurposed to train both policy and critic, enabling one operator to oversee multiple sessions with minimal interruption, which has clear practical value for customer support and similar BPM domains.

major comments (2)

[Abstract] Abstract: The central empirical claims (45% automation rate, 39% AHT reduction, no quality degradation) are stated without any accompanying information on training data volume, model architectures, confidence calibration procedure, statistical significance testing, sample sizes, or controls for confounders such as operator learning curves or concurrent process changes. These omissions directly affect the ability to evaluate whether the results are reproducible or load-bearing for the selective-automation thesis.
[Abstract] The manuscript asserts that copilot feedback is sufficient to train both the action policy and the abstention critic, and that schema-driven UI views plus safe fallbacks ensure safe resumption from operator-corrected states. However, no quantitative analysis of feedback consistency, bias (e.g., time-pressure effects or skill variation), or bounds on downstream error recovery after corrections is provided. This assumption is load-bearing for the reported automation fraction and quality invariance.

minor comments (1)

[Abstract] The abstract would benefit from a single sentence on the scale of the production deployment (number of processes, operators, or sessions) to contextualize the 45% and 39% figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and for highlighting areas where the abstract and supporting analyses can be strengthened. We address each major comment below and will revise the manuscript to improve clarity and completeness while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims (45% automation rate, 39% AHT reduction, no quality degradation) are stated without any accompanying information on training data volume, model architectures, confidence calibration procedure, statistical significance testing, sample sizes, or controls for confounders such as operator learning curves or concurrent process changes. These omissions directly affect the ability to evaluate whether the results are reproducible or load-bearing for the selective-automation thesis.

Authors: We agree that the abstract would be strengthened by including concise references to these details. In the revised version we will expand the abstract to note the scale of training data (production interaction traces from thousands of sessions), the policy and critic architectures (transformer-based models trained on UI traces), the calibration procedure (threshold tuning on held-out feedback), and the evaluation approach (pre/post deployment comparison with operator-stratified controls and no concurrent process changes during the measurement window). These elements are already described in Sections 3 and 4 of the manuscript; the revision will simply surface the key facts in the abstract for immediate visibility. revision: yes
Referee: [Abstract] The manuscript asserts that copilot feedback is sufficient to train both the action policy and the abstention critic, and that schema-driven UI views plus safe fallbacks ensure safe resumption from operator-corrected states. However, no quantitative analysis of feedback consistency, bias (e.g., time-pressure effects or skill variation), or bounds on downstream error recovery after corrections is provided. This assumption is load-bearing for the reported automation fraction and quality invariance.

Authors: We acknowledge that a dedicated quantitative breakdown of feedback consistency and bias sources would further support the claims. The current manuscript relies on the observed production outcomes (sustained quality metrics and rapid onboarding of new processes) as evidence that the feedback is sufficient and that the critic plus schema-driven resumption mechanism works in practice. In revision we will add a short subsection summarizing observed operator acceptance rates, inter-operator agreement on corrections, and resumption success rates from the monitoring logs. We do not have separate instrumentation for time-pressure or skill-variation bias during the original deployment, so we will note this as a limitation and describe the mitigation strategies (critic abstention and safe fallbacks) that were used instead. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical production deployment with measured outcomes

full rationale

The paper presents a deployed system for selective LLM automation in customer support, trained on copilot feedback and evaluated via direct production metrics (45% automation rate, 39% AHT reduction). No equations, derivations, fitted parameters, or mathematical predictions are described. The reported results are obtained from live measurements rather than any internal model output that could reduce to its training inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central claims rest on external empirical observation, making the derivation chain self-contained with no reduction to inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that logged UI traces and binary accept/correct feedback are sufficient to train both a reliable next-action policy and a well-calibrated abstention critic. No free parameters are explicitly named in the abstract, but implicit ones include the confidence threshold for automation and any reward shaping used during policy training. No new entities are postulated.

free parameters (1)

abstention confidence threshold
The point at which the critic decides to execute an action automatically versus deferring to the operator; its value is not stated and must be tuned on production data.

axioms (2)

domain assumption Copilot feedback is an unbiased and sufficiently dense signal for both policy improvement and critic calibration
Invoked when the abstract states that the critic is learned directly from accept/correct labels to calibrate abstention.
domain assumption The schema-driven UI view plus safe fallbacks allow reliable resumption after operator intervention
Required for the claim that the system can run in the background without degrading quality.

pith-pipeline@v0.9.0 · 5487 in / 1611 out tokens · 29671 ms · 2026-05-08T06:03:05.125178+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Os-atlas: A foundation action model for gener- alist gui agents.Preprint, arXiv:2410.23218. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhou- jun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. 2024. Osworld: Benchmarking mult...

work page internal anchor Pith review arXiv 2024
[2]

Qwen3 Technical Report

Attention-driven gui grounding: Leveraging pretrained multimodal large language models without fine-tuning.Proceedings of the AAAI Conference on Artificial Intelligence, 39(8):8851–8859. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, ...

work page internal anchor Pith review arXiv 2025
[3]

Search procedure as first interaction
[4]

When call search procedure use one most common word
[5]

Remember that its cheaper for us if you do action in procedure rather than you write message to client;,→
[6]

Don't communicate with client until you interact with procedure
[7]

If there is a form to fill on the procedure screen fill it accordingly with respect to the information from the client;,→
[8]

Write message only after you have interacted with the procedure and have the solution for the clients problem;,→
[9]

Don't write message to client if there is actions to do in procedure, while procedure is still in progress;,→
[10]

Closing with timer is more common than close without;,→

Don't close chat until you solve clients problem. Closing with timer is more common than close without;,→
[11]

Don't wait for clients response until you wrote him a message; Figure 4: Prompt template used for the prompting baseline