Task-Level AI Readiness Assessment for Business Process Management:The T-IPO Model and LARA Matrix in Financial-Services IT Operations

Mingjun Li; Xiaojun Ye

arxiv: 2605.16297 · v1 · pith:7ZBTVJKInew · submitted 2026-04-16 · 💻 cs.CY · cs.AI· cs.HC· cs.SE

Task-Level AI Readiness Assessment for Business Process Management:The T-IPO Model and LARA Matrix in Financial-Services IT Operations

Mingjun Li , Xiaojun Ye This is my paper

Pith reviewed 2026-05-21 01:21 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HCcs.SE

keywords task-level assessmentLLM agent readinessbusiness process managementT-IPO modelLARA rubriccompliance sensitivityfinancial services ITAI substitution

0 comments

The pith

Task-level assessment with T-IPO and LARA predicts LLM agent performance more precisely than activity-level methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Enterprises need to know which specific tasks inside larger workflows a large language model agent can handle reliably. Most existing frameworks evaluate at the activity level, which often mixes tasks of varying difficulty. This paper introduces the T-IPO representation and the LARA rubric to break processes down to the task level and score readiness across five dimensions, with extra weight on compliance sensitivity. Validation across 127 tasks shows high inter-rater reliability and that actual agent success rates decline predictably as readiness levels drop from L1 to L3. The approach includes a method to update the scores as models improve.

Core claim

The T-IPO model represents each task as an eight-element tuple while the LARA rubric applies a five-dimension assessment with compliance sensitivity weighted at 1.5 times and a floor rule preventing high-compliance tasks from low classifications, yielding four levels that better predict agent performance than activity-level methods.

What carries the argument

LARA rubric, a five-dimension scoring system with 1.5 times weight on compliance sensitivity and a floor rule that produces four readiness levels for LLM agent substitution.

If this is right

Classification into L1 to L4 levels allows targeted decisions on agent deployment for individual tasks.
Pilot data shows auto-completion rates of approximately 95 percent for L1 tasks, 70 percent for L2, and 40 percent for L3.
Exploratory factor analysis identifies cognitive-execution complexity and governance-compliance intensity as the primary underlying factors.
The LARA-TCA recalibration procedure enables ongoing adjustment to advancing LLM capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The T-IPO and LARA approach could be tested in non-financial domains to identify whether similar two-factor structures appear in other workflows.
Organizations might embed the rubric into existing process modeling software to score tasks automatically during design.
Focusing measurement efforts on cognitive complexity and compliance intensity could simplify future versions of the assessment.

Load-bearing premise

The 1.5 times weight for compliance sensitivity fixed via Delphi study and AHP, along with the floor rule and four-level classification, produces generalizable readiness scores that predict agent performance beyond the studied financial-services IT context and 127 tasks.

What would settle it

Applying the LARA rubric to tasks from a different industry outside financial services and checking whether the assigned levels match measured LLM agent auto-completion rates would test whether the predictive utility holds.

Figures

Figures reproduced from arXiv: 2605.16297 by Mingjun Li, Xiaojun Ye.

**Figure 1.** Figure 1: PARTIS hexagonal architecture: six dimensions with the two flows (solid = Execution Flow; dashed = Governance Flow). Execution Flow (solid arrows): Process decomposes into Activities (P→A), Activities are assigned to Roles (A→R), Roles execute Tasks (R→T). Each relationship is formalized as OCL constraints in the PARTIS metamodel (e.g., C1: every Process contains at least one Activity; C5: every Task has e… view at source ↗

**Figure 2.** Figure 2: illustrates the end-to-end PARTIS pipeline on a concrete example from the CM (Configuration Management) domain. The CM process decomposes into the CM.1 Code Management activity, which further decomposes into three tasks via T-IPO. Each task receives five-dimension LARA scores and a resulting level. CM.1.1 (Merge Request Review) scores L2 because D4 (Compliance Sensitivity) = 3 elevates the weighted mean … view at source ↗

**Figure 3.** Figure 3: T-IPO eight-tuple: worked example for CM.1.2 “Code Static Scanning“ (L1, Bloom Lv.2, Deterministic) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Structural mapping from T-IPO eight-tuple to four-section agent prompt architecture. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: LARA distribution: (a) overall; (b) by domain with κ values. Key findings ( [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison: “System Architecture Design“ receives L3 at activity level, masking three L1 tasks visible through T-IPO decomposition. A fair comparison must also consider cost: T-IPO decomposition required approximately 3 weeks of collaborative effort for 127 tasks, whereas activity-level assessment is near-instantaneous. The accuracy-per-effort trade-off favors T-IPO when the downstream cost of missed L1 op… view at source ↗

**Figure 7.** Figure 7: Validation extension roadmap: three parallel streams targeting structural validity (CFA), criterion validity (L3/L4), and decomposition stability (multi-domain T-IPO reproducibility). 7 Discussion 7.1 Contributions The paper‘s contributions are organized in three layers. At the framing level, PARTIS provides a six-dimensional analytical architecture with dual execution–governance cycles, positioning Task a… view at source ↗

read the original abstract

Which tasks inside an enterprise workflow can a large-language-model agent reliably handle, and under what conditions? Most business process modeling frameworks still answer this at the activity level, even though a single activity can bundle work of radically different difficulty. This paper takes the analysis a step smaller. We describe two design artifacts developed in a financial-services IT setting: T-IPO, which represents each task as an eight-element tuple, and LARA (LLM Agent Readiness Assessment), a five-dimension rubric that scores a task's readiness for agent substitution. Compliance Sensitivity carries $1.5\times$ weight, a value we fixed through a three-round Delphi study and cross-checked with AHP. The rubric produces four levels, L1 to L4, and applies a floor rule so that a task with maximum compliance load cannot be classified below L3 no matter what the other scores say. Both artifacts sit inside a larger methodology (PARTIS) that we map onto BWW ontology in Section 3. We evaluate the instruments across 127 tasks. Inter-rater agreement reaches Fleiss' $\kappa = 0.80$; a replication at three further institutions returns $\kappa = 0.73$. A controlled comparison against activity-level assessment suggests, though does not prove, an improvement in predictive utility at the task level. Pilot deployment of 120 task instances confirms that auto-completion decays monotonically from $95\%$ at L1 through about $70\%$ at L2 to about $40\%$ at L3. Exploratory factor analysis points to a two-factor structure: task readiness seems to be determined jointly by cognitive-execution complexity and governance-compliance intensity. We close with a recalibration procedure (LARA-TCA) so the rubric can keep pace with evolving LLM capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task-level tools for LLM agent readiness in financial IT show useful pilot patterns but lean on expert-fixed weights without enough robustness checks.

read the letter

The main takeaway is that this paper gives a practical way to score individual tasks for LLM agent substitution in regulated financial-services IT, using the T-IPO eight-element tuple and the LARA five-dimension rubric with its compliance emphasis and floor rule. The pilot data on 120 instances lines up with the claim that higher readiness levels track lower auto-completion rates, which is the kind of concrete signal practitioners can use right away. They also report decent inter-rater numbers and a replication across sites, plus a factor analysis that splits the drivers into cognitive complexity and compliance intensity. That combination of new artifacts and some real-world numbers is the part worth paying attention to. The mapping to BWW ontology and the built-in recalibration step for future model changes are small but sensible additions that keep the framework from going stale. The soft spot sits with the 1.5x compliance weight and the floor rule. Both were set once through Delphi rounds and AHP inside the original context, and the write-up does not show sensitivity runs or re-derivation against the pilot completion data. If shifting those choices changes which tasks land in which level, the reported monotonic decay and the edge over activity-level assessment start to look more like artifacts of the chosen parameters. The controlled comparison is mentioned but the details on design, exclusions, and tests stay thin, so the predictive-utility advantage is still more suggestive than locked down. This is for BPM people and IT ops teams in finance or similar regulated fields who need to decide which tasks to hand to agents without breaking compliance. A reader who wants specific scoring instruments and initial validation numbers will get something usable, even if they plan to tweak the weights for their own data. I would send it for peer review. The artifacts are new enough and the pilot observations are grounded enough that a referee could help tighten the validation side without starting from scratch.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the T-IPO model (an eight-element tuple for task representation) and LARA rubric (five-dimension assessment with 1.5× weighting on Compliance Sensitivity and a floor rule forcing high-compliance tasks to L3 or above) for evaluating LLM agent readiness at the task level in financial-services IT operations. It reports evaluation on 127 tasks with Fleiss' κ=0.80 inter-rater agreement and κ=0.73 in multi-institution replication, a controlled comparison suggesting improved predictive utility versus activity-level assessment, pilot results on 120 instances with monotonic auto-completion decay (95% at L1, ~70% at L2, ~40% at L3), exploratory factor analysis indicating a two-factor structure (cognitive-execution complexity and governance-compliance intensity), and a proposed LARA-TCA recalibration procedure. The artifacts are situated within the PARTIS methodology mapped to BWW ontology in Section 3.

Significance. If the central claims hold, the work supplies a practical, finer-grained framework for identifying automatable tasks in enterprise BPM, moving beyond activity-level granularity with direct applicability to financial IT operations. Credit is given for the concrete reliability metrics (Fleiss' κ values), multi-institution replication, real pilot deployment data yielding falsifiable predictions, and the adaptive recalibration procedure, all of which strengthen empirical grounding and reproducibility.

major comments (2)

[LARA rubric definition] LARA rubric definition: The 1.5× multiplier on Compliance Sensitivity and the floor rule (any maximum-compliance task forced to L3 or above) are fixed once via the three-round Delphi study and AHP cross-check, with no sensitivity analysis, alternative weightings, or re-derivation from the 120-instance pilot performance data described. This choice is load-bearing for the four-level classification and the reported monotonic decay in auto-completion rates, so the claimed task-level predictive-utility advantage risks being an artifact of these parameters rather than evidence for the underlying construct.
[Evaluation section (controlled comparison)] Evaluation section (controlled comparison): The abstract states that a controlled comparison against activity-level assessment 'suggests, though does not prove, an improvement in predictive utility,' yet provides no details on the comparison design, data exclusion criteria, or statistical tests. This leaves the central claim under-supported and requires explicit reporting of how task-level scores were shown to outperform activity-level baselines on the 127 tasks.

minor comments (2)

[Abstract] The auto-completion percentages for L2 and L3 are reported as 'about 70%' and 'about 40%'; supplying exact figures, sample sizes per level, or confidence intervals from the 120 instances would improve precision.
[Section 3] Section 3 (PARTIS to BWW ontology mapping): The correspondence between the T-IPO eight-element tuple, LARA dimensions, and BWW ontology elements would be clearer with an explicit table or diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments point by point below, indicating the revisions we plan to make to strengthen the work.

read point-by-point responses

Referee: [LARA rubric definition] LARA rubric definition: The 1.5× multiplier on Compliance Sensitivity and the floor rule (any maximum-compliance task forced to L3 or above) are fixed once via the three-round Delphi study and AHP cross-check, with no sensitivity analysis, alternative weightings, or re-derivation from the 120-instance pilot performance data described. This choice is load-bearing for the four-level classification and the reported monotonic decay in auto-completion rates, so the claimed task-level predictive-utility advantage risks being an artifact of these parameters rather than evidence for the underlying construct.

Authors: We agree that the fixed weighting and floor rule, derived from the Delphi study and AHP, would benefit from additional validation. While these parameters were established through expert consensus to reflect domain priorities in financial services, the absence of sensitivity analysis in the manuscript is a limitation. In the revision, we will add a sensitivity analysis section that varies the Compliance Sensitivity multiplier across a range (e.g., 1.0×, 1.5×, 2.0×) and assesses the stability of the L1-L3 classifications and the monotonicity of pilot auto-completion rates. We will also explore alternative weightings and discuss why re-derivation from pilot data alone may not be appropriate given the expert-driven nature of the rubric. This will help confirm that the observed predictive advantages are robust. revision: yes
Referee: [Evaluation section (controlled comparison)] Evaluation section (controlled comparison): The abstract states that a controlled comparison against activity-level assessment 'suggests, though does not prove, an improvement in predictive utility,' yet provides no details on the comparison design, data exclusion criteria, or statistical tests. This leaves the central claim under-supported and requires explicit reporting of how task-level scores were shown to outperform activity-level baselines on the 127 tasks.

Authors: The referee is correct that the manuscript does not provide sufficient details on the controlled comparison, which weakens support for the claim of improved predictive utility. We will revise the Evaluation section to explicitly describe the comparison design, including how activity-level assessments were derived from the task-level data, the criteria for including or excluding tasks from the 127-task set, and the statistical methods employed (such as correlation analysis or predictive accuracy metrics with appropriate tests). We will also clarify the limitations of the comparison as noted in the abstract. These additions will make the evidence for task-level advantages more transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained

full rationale

The LARA rubric weights (including the 1.5× Compliance Sensitivity multiplier) and floor rule are fixed via an independent three-round Delphi study plus AHP cross-check, not derived from or fitted to the pilot performance data. The four-level classification is then applied to 127 tasks for inter-rater testing (Fleiss' κ = 0.80) and to 120 pilot instances, where observed auto-completion rates are reported to decay monotonically. Replication at three further institutions supplies additional external grounding. No equation, level assignment, or predictive claim reduces by construction to the expert inputs; the monotonic decay is an empirical observation tested against agent behavior rather than a tautology. The central task-level predictive-utility claim therefore rests on separate validation steps and does not exhibit self-definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Central claim rests on expert-derived weighting and new conceptual models; limited independent evidence beyond internal evaluations on 127 tasks.

free parameters (1)

Compliance Sensitivity weight = 1.5
Fixed at 1.5x through three-round Delphi study cross-checked with AHP.

axioms (1)

domain assumption PARTIS methodology maps onto BWW ontology
Invoked in Section 3 to situate the T-IPO and LARA artifacts.

invented entities (2)

T-IPO eight-element tuple no independent evidence
purpose: Represents each task for readiness analysis
New design artifact introduced for granular task modeling.
LARA five-dimension rubric no independent evidence
purpose: Scores LLM agent substitution readiness with levels L1-L4
New assessment instrument with weighted compliance and floor rule.

pith-pipeline@v0.9.0 · 5870 in / 1559 out tokens · 44939 ms · 2026-05-21T01:21:31.709083+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LARA five dimensions with scoring anchors and weights; D4 Compliance Sensitivity carries 1.5× weight fixed via three-round Delphi + AHP
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PARTIS hexagonal architecture with Execution Flow (P→A→R→T) and Governance Flow (T→I→S→P); BWW ontological mapping

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

Science 358(6370), 1530–1534 (2017)

Brynjolfsson, E., Mitchell, T.: What can machine learning do? Workforce implications. Science 358(6370), 1530–1534 (2017)

work page 2017
[2]

GPT-4 Technical Report

OpenAI: GPT-4 technical report. arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Science 384(6702), 1306–1308 (2024)

Eloundou, T., Manning, S., Mishkin, P., Rock, D.: GPTs are GPTs: Labor market impact potential of LLMs. Science 384(6702), 1306–1308 (2024)

work page 2024
[4]

OMG: Business Process Model and Notation (BPMN), Version 2.0 (2011)

work page 2011
[5]

Harvard Business School WP No

Dell‘Acqua, F., et al.: Navigating the jagged technological frontier. Harvard Business School WP No. 24-013 (2023)

work page 2023
[6]

Technological Forecasting and Social Change 114, 254–280 (2017)

Frey, C.B., Osborne, M.A.: The future of employment. Technological Forecasting and Social Change 114, 254–280 (2017)

work page 2017
[7]

QJE 118(4), 1279– 1333 (2003)

Autor, D.H., Levy, F., Murnane, R.J.: The skill content of recent technological change. QJE 118(4), 1279– 1333 (2003)

work page 2003
[8]

Rogers, E.M.: Diffusion of Innovations. 5th edn. Free Press (2003)

work page 2003
[9]

Revised edn

Kotter, J.P.: Leading Change. Revised edn. Harvard Business Review Press (2012)

work page 2012
[10]

MIS Quarterly 28(1), 75–105 (2004)

Hevner, A.R., et al.: Design science in IS research. MIS Quarterly 28(1), 75–105 (2004)

work page 2004
[11]

Simon, H.A.: The Sciences of the Artificial. 3rd edn. MIT Press (1996)

work page 1996
[12]

JMIS 24(3), 45–77 (2007)

Peffers, K., et al.: A DSR methodology for IS research. JMIS 24(3), 45–77 (2007)

work page 2007
[13]

MIS Quarterly 37(2), 337–355 (2013)

Gregor, S., Hevner, A.R.: Positioning and presenting DSR for maximum impact. MIS Quarterly 37(2), 337–355 (2013)

work page 2013
[14]

Psychological Bulletin 76(5), 378–382 (1971) 18

Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971) 18

work page 1971
[15]

Gwet, K.L.: Handbook of Inter-Rater Reliability. 4th edn. Advanced Analytics (2014)

work page 2014
[16]

In: NeurIPS 2022

Wei, J., et al.: Chain-of-thought prompting elicits reasoning in LLMs. In: NeurIPS 2022

work page 2022
[17]

OECD WP No

Arntz, M., Gregory, T., Zierahn, U.: The risk of automation for jobs in OECD countries. OECD WP No. 189 (2016)

work page 2016
[18]

Longman (2001)

Anderson, L.W., Krathwohl, D.R.: A Taxonomy for Learning, Teaching, and Assessing. Longman (2001)

work page 2001
[19]

ACM TMIS 14(1), Art

Dumas, M., et al.: AI-augmented BPM systems: A research manifesto. ACM TMIS 14(1), Art. 11 (2023)

work page 2023
[20]

In: BPM 2025, LNCS, vol

Kaltenpoth, S., Skolik, A., Müller, O., Beverungen, D.: A step towards cognitive automation: Integrating LLM agents with process rules. In: BPM 2025, LNCS, vol. 16044. Springer (2026)

work page 2025
[21]

Weske, M.: Business Process Management. 3rd edn. Springer (2019)

work page 2019
[22]

van der Aalst, W.M.P.: Process Mining. 2nd edn. Springer (2016)

work page 2016
[23]

Journal of Computer Information Systems 65(4), 1–29 (2025)

Hughes, L., Dwivedi, Y.K., Malik, T., et al.: AI agents and agentic systems: A multi-expert analysis. Journal of Computer Information Systems 65(4), 1–29 (2025)

work page 2025
[24]

Applied Ergonomics 37(1), 55–79 (2006)

Stanton, N.A.: Hierarchical task analysis. Applied Ergonomics 37(1), 55–79 (2006)

work page 2006
[25]

Erlbaum (1983)

Card, S.K., Moran, T.P., Newell, A.: The Psychology of Human–Computer Interaction. Erlbaum (1983)

work page 1983
[26]

MIT Press (2006)

Crandall, B., Klein, G., Hoffman, R.R.: Working Minds. MIT Press (2006)

work page 2006
[27]

European Parliament: Regulation (EU) 2024/1689 – AI Act (2024)

work page 2024
[28]

Gaithersburg (2024)

NIST: AI Risk Management Framework (AI RMF 1.0). Gaithersburg (2024)

work page 2024
[29]

Minds and Machines 28(4), 689–707 (2018)

Floridi, L., et al.: AI4People. Minds and Machines 28(4), 689–707 (2018)

work page 2018
[30]

Scott, W.R.: Institutions and Organizations. 4th edn. Sage (2014)

work page 2014
[31]

IBM Systems Journal 26(3), 276–292 (1987)

Zachman, J.A.: A framework for IS architecture. IBM Systems Journal 26(3), 276–292 (1987)

work page 1987
[32]

ISJ 5(3), 203–223 (1995)

Wand, Y., Weber, R.: On the deep structure of IS. ISJ 5(3), 203–223 (1995)

work page 1995
[33]

McGraw-Hill (1980)

Saaty, T.L.: The Analytic Hierarchy Process. McGraw-Hill (1980)

work page 1980
[34]

Biometrics 33(1), 159–174 (1977)

Landis, J.R., Koch, G.G.: Observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)

work page 1977
[35]

Quality & Quantity 44(1), 153–166 (2010)

Holgado-Tello, F.P., et al.: Polychoric vs Pearson correlations in EFA. Quality & Quantity 44(1), 153–166 (2010)

work page 2010
[36]

Hair, J.F., et al.: Multivariate Data Analysis. 8th edn. Cengage (2019)

work page 2019
[37]

EJIS 25(1), 77–89 (2016)

Venable, J., et al.: FEDS: A framework for evaluation in DSR. EJIS 25(1), 77–89 (2016)

work page 2016
[38]

Science 381(6654), 187–192 (2023)

Noy, S., Zhang, W.: Experimental evidence on the productivity effects of generative AI. Science 381(6654), 187–192 (2023)

work page 2023
[39]

In: EDOC 2015

Ly, L.T., et al.: A framework for the systematic comparison and evaluation of compliance monitoring approaches. In: EDOC 2015. IEEE, pp. 7–16 (2015)

work page 2015
[40]

OMG: Decision Model and Notation (DMN), Version 1.1 (2016)

work page 2016
[41]

BISE 60(4), 269–272 (2018)

van der Aalst, W.M.P., et al.: Robotic process automation. BISE 60(4), 269–272 (2018)

work page 2018
[42]

JEP 33(2), 3–30 (2019)

Acemoglu, D., Restrepo, P.: Automation and new tasks: How technology displaces and reinstates labor. JEP 33(2), 3–30 (2019)

work page 2019
[43]

Structural Equation Modeling 6(1), 1–55 (1999)

Hu, L., Bentler, P.M.: Cutoff criteria for fit indexes. Structural Equation Modeling 6(1), 1–55 (1999)

work page 1999
[44]

In: ICLR 2024

Hong, S., et al.: MetaGPT: Meta programming for multi-agent collaboration. In: ICLR 2024

work page 2024
[45]

ISJ 3(4), 217–237 (1993)

Wand, Y., Weber, R.: On the ontological expressiveness of IS analysis and design grammars. ISJ 3(4), 217–237 (1993)

work page 1993
[46]

Cambridge UP (1990)

North, D.C.: Institutions, Institutional Change and Economic Performance. Cambridge UP (1990)

work page 1990
[47]

In: ICSE 2026

Wang, H., Poskitt, C.M., Sun, J.: AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents. In: ICSE 2026. ACM (2026)

work page 2026
[48]

Creswell, J.W., Creswell, J.D.: Research Design. 5th edn. Sage (2018)

work page 2018
[49]

McKinsey (2023)

McKinsey Global Institute: The economic potential of generative AI. McKinsey (2023)

work page 2023
[50]

Springer (2012) 19

Reichert, M., Weber, B.: Enabling Flexibility in Process-Aware Information Systems. Springer (2012) 19

work page 2012

[1] [1]

Science 358(6370), 1530–1534 (2017)

Brynjolfsson, E., Mitchell, T.: What can machine learning do? Workforce implications. Science 358(6370), 1530–1534 (2017)

work page 2017

[2] [2]

GPT-4 Technical Report

OpenAI: GPT-4 technical report. arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Science 384(6702), 1306–1308 (2024)

Eloundou, T., Manning, S., Mishkin, P., Rock, D.: GPTs are GPTs: Labor market impact potential of LLMs. Science 384(6702), 1306–1308 (2024)

work page 2024

[4] [4]

OMG: Business Process Model and Notation (BPMN), Version 2.0 (2011)

work page 2011

[5] [5]

Harvard Business School WP No

Dell‘Acqua, F., et al.: Navigating the jagged technological frontier. Harvard Business School WP No. 24-013 (2023)

work page 2023

[6] [6]

Technological Forecasting and Social Change 114, 254–280 (2017)

Frey, C.B., Osborne, M.A.: The future of employment. Technological Forecasting and Social Change 114, 254–280 (2017)

work page 2017

[7] [7]

QJE 118(4), 1279– 1333 (2003)

Autor, D.H., Levy, F., Murnane, R.J.: The skill content of recent technological change. QJE 118(4), 1279– 1333 (2003)

work page 2003

[8] [8]

Rogers, E.M.: Diffusion of Innovations. 5th edn. Free Press (2003)

work page 2003

[9] [9]

Revised edn

Kotter, J.P.: Leading Change. Revised edn. Harvard Business Review Press (2012)

work page 2012

[10] [10]

MIS Quarterly 28(1), 75–105 (2004)

Hevner, A.R., et al.: Design science in IS research. MIS Quarterly 28(1), 75–105 (2004)

work page 2004

[11] [11]

Simon, H.A.: The Sciences of the Artificial. 3rd edn. MIT Press (1996)

work page 1996

[12] [12]

JMIS 24(3), 45–77 (2007)

Peffers, K., et al.: A DSR methodology for IS research. JMIS 24(3), 45–77 (2007)

work page 2007

[13] [13]

MIS Quarterly 37(2), 337–355 (2013)

Gregor, S., Hevner, A.R.: Positioning and presenting DSR for maximum impact. MIS Quarterly 37(2), 337–355 (2013)

work page 2013

[14] [14]

Psychological Bulletin 76(5), 378–382 (1971) 18

Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971) 18

work page 1971

[15] [15]

Gwet, K.L.: Handbook of Inter-Rater Reliability. 4th edn. Advanced Analytics (2014)

work page 2014

[16] [16]

In: NeurIPS 2022

Wei, J., et al.: Chain-of-thought prompting elicits reasoning in LLMs. In: NeurIPS 2022

work page 2022

[17] [17]

OECD WP No

Arntz, M., Gregory, T., Zierahn, U.: The risk of automation for jobs in OECD countries. OECD WP No. 189 (2016)

work page 2016

[18] [18]

Longman (2001)

Anderson, L.W., Krathwohl, D.R.: A Taxonomy for Learning, Teaching, and Assessing. Longman (2001)

work page 2001

[19] [19]

ACM TMIS 14(1), Art

Dumas, M., et al.: AI-augmented BPM systems: A research manifesto. ACM TMIS 14(1), Art. 11 (2023)

work page 2023

[20] [20]

In: BPM 2025, LNCS, vol

Kaltenpoth, S., Skolik, A., Müller, O., Beverungen, D.: A step towards cognitive automation: Integrating LLM agents with process rules. In: BPM 2025, LNCS, vol. 16044. Springer (2026)

work page 2025

[21] [21]

Weske, M.: Business Process Management. 3rd edn. Springer (2019)

work page 2019

[22] [22]

van der Aalst, W.M.P.: Process Mining. 2nd edn. Springer (2016)

work page 2016

[23] [23]

Journal of Computer Information Systems 65(4), 1–29 (2025)

Hughes, L., Dwivedi, Y.K., Malik, T., et al.: AI agents and agentic systems: A multi-expert analysis. Journal of Computer Information Systems 65(4), 1–29 (2025)

work page 2025

[24] [24]

Applied Ergonomics 37(1), 55–79 (2006)

Stanton, N.A.: Hierarchical task analysis. Applied Ergonomics 37(1), 55–79 (2006)

work page 2006

[25] [25]

Erlbaum (1983)

Card, S.K., Moran, T.P., Newell, A.: The Psychology of Human–Computer Interaction. Erlbaum (1983)

work page 1983

[26] [26]

MIT Press (2006)

Crandall, B., Klein, G., Hoffman, R.R.: Working Minds. MIT Press (2006)

work page 2006

[27] [27]

European Parliament: Regulation (EU) 2024/1689 – AI Act (2024)

work page 2024

[28] [28]

Gaithersburg (2024)

NIST: AI Risk Management Framework (AI RMF 1.0). Gaithersburg (2024)

work page 2024

[29] [29]

Minds and Machines 28(4), 689–707 (2018)

Floridi, L., et al.: AI4People. Minds and Machines 28(4), 689–707 (2018)

work page 2018

[30] [30]

Scott, W.R.: Institutions and Organizations. 4th edn. Sage (2014)

work page 2014

[31] [31]

IBM Systems Journal 26(3), 276–292 (1987)

Zachman, J.A.: A framework for IS architecture. IBM Systems Journal 26(3), 276–292 (1987)

work page 1987

[32] [32]

ISJ 5(3), 203–223 (1995)

Wand, Y., Weber, R.: On the deep structure of IS. ISJ 5(3), 203–223 (1995)

work page 1995

[33] [33]

McGraw-Hill (1980)

Saaty, T.L.: The Analytic Hierarchy Process. McGraw-Hill (1980)

work page 1980

[34] [34]

Biometrics 33(1), 159–174 (1977)

Landis, J.R., Koch, G.G.: Observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)

work page 1977

[35] [35]

Quality & Quantity 44(1), 153–166 (2010)

Holgado-Tello, F.P., et al.: Polychoric vs Pearson correlations in EFA. Quality & Quantity 44(1), 153–166 (2010)

work page 2010

[36] [36]

Hair, J.F., et al.: Multivariate Data Analysis. 8th edn. Cengage (2019)

work page 2019

[37] [37]

EJIS 25(1), 77–89 (2016)

Venable, J., et al.: FEDS: A framework for evaluation in DSR. EJIS 25(1), 77–89 (2016)

work page 2016

[38] [38]

Science 381(6654), 187–192 (2023)

Noy, S., Zhang, W.: Experimental evidence on the productivity effects of generative AI. Science 381(6654), 187–192 (2023)

work page 2023

[39] [39]

In: EDOC 2015

Ly, L.T., et al.: A framework for the systematic comparison and evaluation of compliance monitoring approaches. In: EDOC 2015. IEEE, pp. 7–16 (2015)

work page 2015

[40] [40]

OMG: Decision Model and Notation (DMN), Version 1.1 (2016)

work page 2016

[41] [41]

BISE 60(4), 269–272 (2018)

van der Aalst, W.M.P., et al.: Robotic process automation. BISE 60(4), 269–272 (2018)

work page 2018

[42] [42]

JEP 33(2), 3–30 (2019)

Acemoglu, D., Restrepo, P.: Automation and new tasks: How technology displaces and reinstates labor. JEP 33(2), 3–30 (2019)

work page 2019

[43] [43]

Structural Equation Modeling 6(1), 1–55 (1999)

Hu, L., Bentler, P.M.: Cutoff criteria for fit indexes. Structural Equation Modeling 6(1), 1–55 (1999)

work page 1999

[44] [44]

In: ICLR 2024

Hong, S., et al.: MetaGPT: Meta programming for multi-agent collaboration. In: ICLR 2024

work page 2024

[45] [45]

ISJ 3(4), 217–237 (1993)

Wand, Y., Weber, R.: On the ontological expressiveness of IS analysis and design grammars. ISJ 3(4), 217–237 (1993)

work page 1993

[46] [46]

Cambridge UP (1990)

North, D.C.: Institutions, Institutional Change and Economic Performance. Cambridge UP (1990)

work page 1990

[47] [47]

In: ICSE 2026

Wang, H., Poskitt, C.M., Sun, J.: AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents. In: ICSE 2026. ACM (2026)

work page 2026

[48] [48]

Creswell, J.W., Creswell, J.D.: Research Design. 5th edn. Sage (2018)

work page 2018

[49] [49]

McKinsey (2023)

McKinsey Global Institute: The economic potential of generative AI. McKinsey (2023)

work page 2023

[50] [50]

Springer (2012) 19

Reichert, M., Weber, B.: Enabling Flexibility in Process-Aware Information Systems. Springer (2012) 19

work page 2012