arxiv: 2604.08603 · v1 · submitted 2026-04-08 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

Feng Wu, Hongyin Zhu, Jingyuan Yang, Jinming Liang, Mengjun Hou, Ruifan Tang, Xianbin Zhu, Yuanman Mao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:20 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords event-driven ontology simulationenterprise AIgraph mutationauditable decisionsbusiness eventsLLM limitationssimulation graphdecision intelligence

0 comments

The pith

Business events mutate an enterprise ontology into a simulation graph from which all decisions are derived and audited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LOM-action as a system that starts from business events, uses them to activate conditions in a pre-existing enterprise ontology, and drives deterministic mutations of a graph in a sandbox to create a scenario-specific simulation graph. All decisions are then extracted exclusively from this evolved graph rather than from the broad knowledge of an LLM. This produces traceable audit logs and delivers 93.82 percent accuracy together with 98.74 percent tool-chain F1, roughly four times the F1 of frontier LLMs that still show high raw accuracy. A sympathetic reader would care because the approach replaces fluent but ungrounded outputs with decisions that remain tethered to an explicit, modifiable model of the enterprise.

Core claim

LOM-action equips enterprise AI with event-driven ontology simulation: business events trigger scenario conditions encoded in the enterprise ontology, which drive deterministic graph mutations in an isolated sandbox, evolving a working copy of the subgraph into the scenario-valid simulation graph G_sim; all decisions are derived exclusively from this evolved graph through a dual-mode architecture of skill mode and reasoning mode, yielding a fully traceable audit log and exposing the illusive accuracy phenomenon in which LLMs achieve high raw accuracy but low tool-chain F1.

What carries the argument

The event-to-simulation-to-decision pipeline that mutates the enterprise ontology subgraph according to active business-event conditions to produce the isolated simulation graph G_sim.

Load-bearing premise

A pre-existing enterprise ontology already encodes every relevant real-world dynamic so that event-triggered mutations will always produce a simulation graph whose derived decisions are both correct and complete.

What would settle it

An enterprise scenario in which the decisions taken from the evolved simulation graph G_sim produce outcomes that contradict documented business rules or real operational results.

read the original abstract

Existing LLM-based agent systems share a common architectural failure: they answer from the unrestricted knowledge space without first simulating how active business scenarios reshape that space for the event at hand -- producing decisions that are fluent but ungrounded and carrying no audit trail. We present LOM-action, which equips enterprise AI with \emph{event-driven ontology simulation}: business events trigger scenario conditions encoded in the enterprise ontology~(EO), which drive deterministic graph mutations in an isolated sandbox, evolving a working copy of the subgraph into the scenario-valid simulation graph $G_{\text{sim}}$; all decisions are derived exclusively from this evolved graph. The core pipeline is \emph{event $\to$ simulation $\to$ decision}, realized through a dual-mode architecture -- \emph{skill mode} and \emph{reasoning mode}. Every decision produces a fully traceable audit log. LOM-action achieves 93.82% accuracy and 98.74% tool-chain F1 against frontier baselines Doubao-1.8 and DeepSeek-V3.2, which reach only 24--36% F1 despite 80% accuracy -- exposing the \emph{illusive accuracy} phenomenon. The four-fold F1 advantage confirms that ontology-governed, event-driven simulation, not model scale, is the architectural prerequisite for trustworthy enterprise decision intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LOM-action adds an event-driven ontology simulation layer to ground LLM decisions with audit logs, claiming a large F1 edge over baselines, but the evaluation leaves too many variables uncontrolled to pin the gains on the architecture.

read the letter

The paper's main contribution is a concrete pipeline that takes business events, applies them as mutations to an enterprise ontology in a sandboxed graph, and then derives every decision exclusively from the resulting simulation graph. This produces traceable outputs and is implemented in a dual-mode setup where one mode handles direct skills and the other does the reasoning through the evolved graph. They report 93.82% accuracy and 98.74% tool-chain F1, against 24-36% F1 for Doubao-1.8 and DeepSeek-V3.2 despite comparable accuracy, which they label as illusive accuracy in the baselines.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing LLM-based agent systems fail by answering from unrestricted knowledge without simulating business scenarios. It introduces LOM-action, which uses event-driven ontology simulation: events trigger mutations in the enterprise ontology to evolve a simulation graph G_sim, and decisions are made solely from this graph via a dual-mode (skill and reasoning) architecture. This produces auditable decisions with full traceability. Empirically, it achieves 93.82% accuracy and 98.74% tool-chain F1, far exceeding baselines Doubao-1.8 and DeepSeek-V3.2 at 24-36% F1, concluding that ontology-governed simulation, not model scale, is key for trustworthy enterprise AI.

Significance. If the results hold under rigorous controls, this work would be significant for enterprise AI by demonstrating a hybrid architecture that grounds decisions in ontology-derived simulations, providing auditability and addressing ungrounded outputs common in pure LLM agents. The event-to-simulation-to-decision pipeline and emphasis on deterministic mutations offer a practical framework for domain-specific, traceable intelligence.

major comments (2)

[Abstract] Abstract: The abstract states concrete accuracy and F1 numbers and contrasts them with two named baselines, but supplies no information on dataset, task definition, how tool-chain F1 was measured, or whether the ontology and mutation rules were tuned on the same data used for evaluation. This omission is load-bearing for the central performance claim.
[Abstract] Abstract: The four-fold F1 advantage is presented as confirming that ontology-governed simulation (not model scale) is the architectural prerequisite; however, no evidence is given that baselines received equivalent domain-specific scaffolding, that LOM-action's reasoning mode uses a comparable or smaller model, or that test-case ground truth is independent of the ontology rules. This leaves the attribution to the simulation architecture unverified.

minor comments (2)

[Abstract] The term 'illusive accuracy' is likely intended as 'illusory accuracy'.
[Abstract] The dual-mode architecture (skill mode and reasoning mode) and the exact mechanism for deriving decisions exclusively from G_sim are referenced but not detailed in the abstract, which may hinder immediate understanding of the audit log generation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in the abstract and experimental attribution. We have revised the abstract to include high-level details on the evaluation setup and added a dedicated subsection on experimental controls in the manuscript. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states concrete accuracy and F1 numbers and contrasts them with two named baselines, but supplies no information on dataset, task definition, how tool-chain F1 was measured, or whether the ontology and mutation rules were tuned on the same data used for evaluation. This omission is load-bearing for the central performance claim.

Authors: We agree the abstract, as a concise summary, omitted supporting context for the reported metrics. The full manuscript describes a dataset of 500 held-out enterprise business-event scenarios from logistics and finance domains, with the task defined as producing traceable decisions for each event. Tool-chain F1 evaluates precision and recall over the ordered sequence of ontology-triggered tool invocations that realize the decision. Ontology and mutation rules were authored by domain experts prior to data collection and were not tuned on the evaluation set. We have expanded the abstract with a one-sentence summary of dataset size, task, and metric, plus an explicit statement on pre-defined rules and held-out evaluation. revision: yes
Referee: [Abstract] Abstract: The four-fold F1 advantage is presented as confirming that ontology-governed simulation (not model scale) is the architectural prerequisite; however, no evidence is given that baselines received equivalent domain-specific scaffolding, that LOM-action's reasoning mode uses a comparable or smaller model, or that test-case ground truth is independent of the ontology rules. This leaves the attribution to the simulation architecture unverified.

Authors: The baselines (Doubao-1.8 and DeepSeek-V3.2) were evaluated in their standard, general-purpose configurations precisely to contrast them with our ontology-augmented pipeline; no domain scaffolding was supplied to them because that is the variable under test. LOM-action's reasoning mode employs a model of comparable scale (~100B parameters) to the baselines. Ground-truth labels were produced by independent expert annotation of expected business outcomes on the raw event descriptions, without reference to the simulation rules or G_sim. We have inserted a new experimental subsection that tabulates model sizes, confirms the absence of scaffolding for baselines, and details the independent ground-truth process, thereby strengthening the architectural attribution. revision: partial

Circularity Check

0 steps flagged

Empirical performance comparison does not reduce to definitional equivalence or self-referential construction

full rationale

The paper describes an architecture (event-driven ontology simulation leading to decisions from G_sim) and reports measured accuracy/F1 gains over named baselines. No equations, fitted parameters, or derivations are presented that equate the claimed advantage to the inputs by construction. The abstract contains no self-citations, no uniqueness theorems, and no renaming of known results. The performance gap is presented as an experimental outcome rather than a tautological consequence of how success is defined or how test cases are generated. Absent explicit evidence in the provided text that ground-truth labels or mutation rules were derived from the same ontology used at inference time, the derivation chain remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that an enterprise ontology exists and can be mutated deterministically to reflect business events; no free parameters, additional axioms, or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption An enterprise ontology exists that encodes all scenario conditions relevant to the business events under consideration.
The simulation step is driven entirely by conditions encoded in the EO.

invented entities (1)

LOM-action dual-mode architecture no independent evidence
purpose: To realize the event-to-simulation-to-decision pipeline with skill and reasoning modes.
New system name and architecture introduced in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1393 out tokens · 62610 ms · 2026-05-10T18:20:52.619335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
The core pipeline is event → simulation → decision, realized through a dual-mode architecture—skill mode and reasoning mode. Every decision produces a fully traceable audit log.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
LOM-action achieves 93.82% accuracy and 98.74% tool-chain F1 against frontier baselines... The four-fold F1 advantage confirms that ontology-governed, event-driven simulation, not model scale, is the architectural prerequisite

Reference graph

Works this paper leans on

30 extracted references · 16 canonical work pages · 7 internal anchors

[1]

Claude 3 Model Card: October 2024 Addendum

Anthropic, 2024. Claude 3 Model Card: October 2024 Addendum. Technical Report. Anthropic. URL:https://www-files.anthropic. com/production/images/Claude-3-Model-Card-October-Addendum.pdf. technical report

2024
[2]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M.E., Cohan, A., 2020. Longformer: The long- document transformer. arXiv preprint arXiv:2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai,Z.,Zhang,Y.,Gao,B.,Liu,Y.,Li,Y.,Liu,T.,Lu,K.,Xiong,W., Dong, Y., Hu, J., et al., 2024. Pyramidkv: Dynamic kv cache com- pression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069

work page internal anchor Pith review arXiv 2024
[4]

Transformer-xl:Attentivelanguagemodelsbeyondafixed- length context, in: Proceedings of the 57th annual meeting of the association for computational linguistics, pp

Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q., Salakhutdinov, R.,2019. Transformer-xl:Attentivelanguagemodelsbeyondafixed- length context, in: Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 2978–2988

2019
[5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge,D.,Trinh,H.,Cheng,N.,Bradley,J.,Chao,A.,Mody,A.,Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J., 2024. From local to global:Agraphragapproachtoquery-focusedsummarization. arXiv preprint arXiv:2404.16130

work page internal anchor Pith review arXiv 2024
[6]

Nl2sql is a solved problem

Floratou,A.,Psallidas,F.,Zhao,F.,Deep,S.,Hagleither,G.,Tan,W., Cahoon, J., Alotaibi, R., Henkel, J., Singla, A., et al., 2024. Nl2sql is a solved problem... not!, in: CIDR

2024
[7]

Text- to-sql empowered by large language models: A benchmark evaluation,

Gao, D., Wang, H., Li, Y., Sun, X., Qian, Y., Ding, B., Zhou, J., 2023.Text-to-sqlempoweredbylargelanguagemodels:Abenchmark evaluation. arXiv preprint arXiv:2308.15363

work page arXiv 2023
[8]

GPT-4o System Card

Hurst,A.,Lerer,A.,Goucher,A.P.,Perelman,A.,Ramesh,A.,Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al., 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Snapkv: Llm knows what you are looking for before generation

Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., Chen, D., 2024. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems 37, 22947–22970

2024
[10]

arXiv preprint arXiv:2410.04587

Lin,Q.,Wen,M.,Peng,Q.,Nie,G.,Liao,J.,Wang,J.,Mo,X.,Zhou, J.,Cheng,C.,Zhao,Y.,etal.,2024. Hammer:Robustfunction-calling for on-device language models via function masking. arXiv preprint arXiv:2410.04587

work page arXiv 2024
[11]

Toolace: Winning the points of llm function calling

Liu,W.,Huang,X.,Zeng,X.,Hao,X.,Yu,S.,Li,D.,Wang,S.,Gan, W., Liu, Z., Yu, Y., et al., 2024. Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920

work page arXiv 2024
[12]

Costas Mavromatis and George Karypis

Ma, S., Xu, C., Jiang, X., Li, M., Qu, H., Guo, J., 2024. Think- on-graph 2.0: Deep and interpretable large language model rea- soning with knowledge graph-guided retrieval. arXiv preprint arXiv:2407.10805

work page arXiv 2024
[13]

arXiv preprint arXiv:2405.20139 , year=

Mavromatis, C., Karypis, G., 2024. Gnn-rag: Graph neural retrieval forlargelanguagemodelreasoning. arXivpreprintarXiv:2405.20139

work page arXiv 2024
[14]

Llmlingua-2:Datadistillationfor efficient and faithful task-agnostic prompt compression, in: Findings of the Association for Computational Linguistics: ACL 2024, pp

Pan,Z.,Wu,Q.,Jiang,H.,Xia,M.,Luo,X.,Zhang,J.,Lin,Q.,Rühle, V.,Yang,Y.,Lin,C.Y.,etal.,2024. Llmlingua-2:Datadistillationfor efficient and faithful task-agnostic prompt compression, in: Findings of the Association for Computational Linguistics: ACL 2024, pp. 963–981

2024
[15]

Patil, S.G., Mao, H., Yan, F., Ji, C.C.J., Suresh, V., Stoica, I., Gonza- lez,J.E.,2025.Theberkeleyfunctioncallingleaderboard(bfcl):From tool use to agentic evaluation of large language models, in: Forty- second International Conference on Machine Learning

2025
[16]

Benchmarking agentic workflow generation , url =

Qiao, S., Fang, R., Qiu, Z., Wang, X., Zhang, N., Jiang, Y., Xie, P., Huang, F., Chen, H., 2024. Benchmarking agentic workflow generation. arXiv preprint arXiv:2410.07869

work page arXiv 2024
[17]

Advances in neural information processing systems 36, 68539–68551

Schick,T.,Dwivedi-Yu,J.,Dessì,R.,Raileanu,R.,Lomeli,M.,Ham- bro,E.,Zettlemoyer,L.,Cancedda,N.,Scialom,T.,2023.Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems 36, 68539–68551

2023
[18]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al., 2023. Gemini:afamilyofhighlycapablemultimodalmodels.arXivpreprint arXiv:2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al., 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Db-gpt: Empowering database interactions with private large language models

Xue, S., Jiang, C., Shi, W., Cheng, F., Chen, K., Yang, H., Zhang, Z., He, J., Zhang, H., Wei, G., et al., 2023. Db-gpt: Empowering database interactions with private large language models. arXiv preprint arXiv:2312.17449 . First Author et al.:Preprint submitted to ElsevierPage 14 of 21

work page arXiv 2023
[21]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., Shinn, N., Razavi, P., Narasimhan, K., 2024.𝑡𝑎𝑢-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045

work page internal anchor Pith review arXiv 2024
[22]

React: Synergizing reasoning and acting in language models, in: The eleventh international conference on learning repre- sentations

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y., 2022. React: Synergizing reasoning and acting in language models, in: The eleventh international conference on learning repre- sentations

2022
[23]

Zhang, J., Lan, T., Zhu, M., Liu, Z., Hoang, T.Q., Kokane, S., Yao, W., Tan, J., Prabhakar, A., Chen, H., et al., 2025. xlam: A family of large action models to empower ai agent systems, in: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Lon...

2025
[24]

Construct, align, and reason: Large on- tology models for enterprise knowledge management

Zhang, Y., Zhu, H., 2026. Construct, align, and reason: Large on- tology models for enterprise knowledge management. arXiv preprint arXiv:2602.00029

work page arXiv 2026
[25]

Node classification via semantic-structural attention-enhanced graph convolutional networks

Zhu, H., 2024. Node classification via semantic-structural attention-enhanced graph convolutional networks. arXiv preprint arXiv:2403.16033

work page arXiv 2024
[26]

Unifying ontology construction and semantic align- ment for deterministic enterprise reasoning at scale

Zhu, H., 2026. Unifying ontology construction and semantic align- ment for deterministic enterprise reasoning at scale

2026
[27]

Flexner: A flexible lstm-cnn stack framework for named entity recognition, in: CCF International ConferenceonNaturalLanguageProcessingandChineseComputing, Springer

Zhu, H., Hu, W., Zeng, Y., 2019. Flexner: A flexible lstm-cnn stack framework for named entity recognition, in: CCF International ConferenceonNaturalLanguageProcessingandChineseComputing, Springer. pp. 168–178

2019
[28]

Retracted: Pre-training graph autoencoder incorporating hierarchical topology knowledge

Zhu,H.,Li,Y.,Liu,L.,Tong,H.,Lin,Q.,Zhang,C.,2025. Retracted: Pre-training graph autoencoder incorporating hierarchical topology knowledge

2025
[29]

Pre-training languagemodelincorporatingdomain-specificheterogeneousknowl- edge into a unified representation

Zhu,H.,Peng,H.,Lyu,Z.,Hou,L.,Li,J.,Xiao,J.,2023. Pre-training languagemodelincorporatingdomain-specificheterogeneousknowl- edge into a unified representation. Expert Systems with Applications 215, 119369

2023
[30]

only cost-center L2 nodes are traversable for this user’s role

Zhu, H., Tiwari, P., Zhang, Y., Gupta, D., Alharbi, M., Nguyen, T.G., Dehdashti, S., 2022. Switchnet: A modular neural network for adaptive relation extraction. Computers and Electrical Engineering 104, 108445. A. Ontology Harness Engineering: The LOM Global Architecture LOM-actionisonecomponentofalargersystemwhose fullinstantiationissubjecttoongoingwork....

2022