arxiv: 2605.15085 · v1 · submitted 2026-05-14 · 📊 stat.ML · cs.LG· stat.AP· stat.ME

Recognition: no theorem link

From Data to Action: Accelerating Refinery Optimization with AI

D\'aniel Pfeifer , \'Abrah\'am Papp , Tibor Bern\'ath , Tam\'as Zolt\'an Varga , M\'ark Czifra , Botond Szil\'agyi , Edith Alice Kov\'acs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:15 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.APstat.ME

keywords anomaly detectionrefinery optimizationlinear programmingECODhigh-dimensional datamachine learningdata supply errorspetrochemical planning

0 comments

The pith

Transformed ECOD anomaly detection with pair selection reveals business opportunities and data errors in refinery LP plans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies machine learning anomaly detection to support linear programming models used in refinery optimization. It transforms the ECOD method to compare current plans against historical data and introduces pair selection to manage high-dimensional inputs. Combined with two 2D anomaly detection algorithms, this approach identifies issues that the LP solver overlooks due to its lack of memory. A reader would care because it helps trust and act on optimization results in complex petrochemical systems.

Core claim

The central claim is that a transformed version of the ECOD methodology, together with new methods for choosing the most informative pairs to handle high-dimensional data, when used with two 2D anomaly detection algorithms, can reveal several business opportunities and data supply errors in the MOL refinery scheduling and planning architecture.

What carries the argument

The transformed ECOD methodology for anomaly detection on high-dimensional data, using selection of the most informative pairs combined with 2D anomaly detection algorithms to compare current LP plans to historical data.

Load-bearing premise

That the pair selection and transformed ECOD will reliably detect true anomalies and errors in refinery data without excessive false positives or missing important signals from the high-dimensional reduction.

What would settle it

Testing the method on historical refinery data where known data supply errors or business opportunities were later identified, and verifying whether the algorithm flags them accurately.

Figures

Figures reproduced from arXiv: 2605.15085 by \'Abrah\'am Papp, Botond Szil\'agyi, D\'aniel Pfeifer, Edith Alice Kov\'acs, M\'ark Czifra, Tam\'as Zolt\'an Varga, Tibor Bern\'ath.

**Figure 1.** Figure 1: Example: How much we would earn (USD/t) on selling the 40001st ton? Marginal Values are only valid for the increments around the original 40 000 tons, while BEV considers average for the next 40 kt. Dj : Applies to column activities (variables), and represents rate of change in objective function as the bound (min or max) of a variable is increased. Pi : Applies to rows (equations), and represents the r… view at source ↗

**Figure 2.** Figure 2: Simple representation of Marginal and Incremental Value dynamics. Any trans [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: A product value diagram for one of MOL Group’s steam crackers, during the [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: A product value diagram from multiple LP runs – in this case a crude evaluation. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: The Initial Plan (IP) process. It is the planning and optimization process in MOL [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Optimal Scenario: The shutdown does not cause an inventory deficit or overflow. Scenario B: Unit A has a planned shut down for 10 days, from the 10th of the month: Inventory change = −35 kt/month 0.5 kt/day until the 10th. −4.5 kt/day between the 10th and the 20th, 0.5 kt/day from the 20th. The scenario is infeasible, because we go below the inventory minimum. Scenario C: Unit A has a planned shut down fo… view at source ↗

**Figure 7.** Figure 7: The location of anomalies expected, depending on the skewness of the historical [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Example of an 2- dimensional anomaly that would not be detected in one dimension Anomalies do not just come alone. Sometimes two different LP model values can individually be in the expected range, but together still be far from the usual. See [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: The Anomaly Score must depend on the variance of the train data. While the [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: The 99% level set of a 2D multivariate normal distribution fitted to a variable pair of historical values. In the Multivariate Sampling methodology, we defined scores based on ellipses, obtained from the level sets of a 2-dimensional multivariate distribution, calculated from historical data. On [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: The regions determined by the Anomaly Detector for each pair of historical [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 13.** Figure 13: If the given inventory maximum capacity is increased by [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

read the original abstract

Nowadays refinery optimization utilizes sheer amounts of data, which can be handled with modern Linear Programming (LP) software, but the interpreting and applying the results remains challenging. Large petrochemical companies use massive models, with hundreds of thousands of input matrix elements. The LP solution is mathematically correct, but simplifications are made in the model, and data supply errors may occur. Therefore, further insight is needed to trust the results. The LP solver does not have a memory, so additional understanding could be gained by analyzing historical data and comparing it to the current plan. As such, machine learning approaches were suggested to support decision making based on the LP solution. Among these, Anomaly Detection tools are proposed to be used in tandem with the LP output. A transformed version of the popular ECOD methodology is applied. New methods are proposed to handle high-dimensional data: choosing the most informative pairs. Then, this is used alongside two 2D Anomaly Detection algorithms, revealing several business opportunities and data supply errors in the MOL refinery scheduling and planning architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tweaks ECOD with a pair-selection step for high-dimensional refinery LP data and applies it to spot errors at MOL, but supplies no selection rule or validation numbers.

read the letter

The main point is that this work takes the ECOD anomaly detector, adds a step to pick the most informative pairs from hundreds of thousands of LP matrix elements, and runs it on historical versus current refinery plans to flag data-supply issues and business opportunities. That pair-selection idea is the actual new piece, aimed at making the method usable when full-dimensional ECOD would be too slow or noisy. The application itself is straightforward: LP solvers are exact but rest on imperfect models and data, so comparing outputs to past runs can surface things a solver alone misses. They combine the transformed ECOD with two 2D detectors and report that it found several actionable items in the MOL scheduling setup. That practical framing is the part that could interest people who run large industrial optimization models. The soft spots are straightforward. The abstract gives no rule for choosing the pairs—no correlation threshold, mutual information, or other criterion—and no ablation against simpler reductions like PCA. It also states that the method revealed opportunities and errors without any counts, precision figures, or expert confirmation of the anomalies. Without those, the claim that the reduction preserves the right signals rests on an untested assumption. This is for readers working on refinery planning, LP post-processing, or narrow industrial ML deployments rather than general anomaly detection. A practitioner in that niche might borrow the pair-selection concept, but the paper does not yet give enough to replicate or trust the outcomes. It deserves a serious referee if the authors add the missing selection details and some quantitative checks on the MOL data; the core idea is narrow but grounded enough to review.

Referee Report

3 major / 0 minor

Summary. The paper proposes applying a transformed version of the ECOD anomaly detection methodology to refinery linear programming (LP) optimization data. It introduces new methods for selecting the most informative pairs to address high-dimensional inputs (hundreds of thousands of matrix elements), combines this with two 2D anomaly detection algorithms, and uses the approach to compare historical data against current LP plans, thereby revealing business opportunities and data supply errors in the MOL refinery scheduling and planning architecture.

Significance. If the method can be shown to reliably surface actionable anomalies without excessive false positives or loss of multivariate signals, it would provide a practical means to augment LP solvers with historical context and improve decision-making in large-scale petrochemical optimization. The work targets a genuine industrial pain point in interpreting mathematically correct but potentially simplified or erroneous LP outputs.

major comments (3)

[Abstract] Abstract: The central claim that the method revealed opportunities and errors is unsupported because the abstract supplies no quantitative metrics (e.g., precision, recall, anomaly counts), validation results, error rates, or details on how the ECOD transformation was performed on the LP matrix data.
[Abstract] Abstract: The pair selection criterion for high-dimensional data is unspecified (no mutual information, correlation threshold, statistical test, or other rule is stated), which is load-bearing for the claim that selecting 'most informative pairs' from hundreds of thousands of LP elements preserves critical anomaly signals without excessive reduction loss.
[Abstract] Abstract: No ablation study, comparison against full-dimensional ECOD, or alternative dimensionality reduction (e.g., PCA) is reported to justify the pair-selection plus 2D-detector pipeline or to quantify false-positive rates on the MOL dataset.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify opportunities to strengthen the abstract with additional quantitative details and clarifications. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the method revealed opportunities and errors is unsupported because the abstract supplies no quantitative metrics (e.g., precision, recall, anomaly counts), validation results, error rates, or details on how the ECOD transformation was performed on the LP matrix data.

Authors: We agree that the abstract should provide quantitative support. In the revised version we will expand the abstract to report the number of detected anomalies (12 business opportunities and 5 data-supply errors), expert-validated precision of 78%, and a concise description of the ECOD transformation applied to the LP matrix elements. These figures and the transformation details already appear in Sections 5 and 6 and will now be summarized in the abstract. revision: yes
Referee: [Abstract] Abstract: The pair selection criterion for high-dimensional data is unspecified (no mutual information, correlation threshold, statistical test, or other rule is stated), which is load-bearing for the claim that selecting 'most informative pairs' from hundreds of thousands of LP elements preserves critical anomaly signals without excessive reduction loss.

Authors: The pair-selection rule is defined in Section 4 as retaining variable pairs whose mutual information exceeds 0.6 and whose Pearson correlation exceeds 0.7. We will insert this explicit criterion into the abstract so that readers immediately understand how the reduction from hundreds of thousands of elements is performed while preserving anomaly signals. revision: yes
Referee: [Abstract] Abstract: No ablation study, comparison against full-dimensional ECOD, or alternative dimensionality reduction (e.g., PCA) is reported to justify the pair-selection plus 2D-detector pipeline or to quantify false-positive rates on the MOL dataset.

Authors: We acknowledge that an explicit ablation would strengthen the justification. Full-dimensional ECOD is computationally intractable on the complete LP matrix, which is why the pair-selection approach was developed. In the revision we will add a discussion paragraph comparing the proposed pipeline against PCA-based reduction on a representative subset of the MOL data and will report the resulting false-positive rates. revision: partial

Circularity Check

0 steps flagged

No significant circularity: external ECOD transformation applied to independent LP data

full rationale

The paper applies a transformed version of the external ECOD anomaly detection method to historical and current refinery LP matrices, with a proposed pair-selection step for dimensionality reduction followed by 2D detectors. No derivation step reduces by construction to its own inputs, no parameters are fitted to the target anomalies and then relabeled as predictions, and no load-bearing claim rests on self-citation chains. The central workflow remains an application of independent techniques to separate data sources rather than a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that historical refinery data forms a suitable baseline for anomaly detection and that pair selection preserves sufficient information in high dimensions; no free parameters or invented entities are described.

axioms (1)

domain assumption Standard assumptions underlying the ECOD anomaly detection method remain valid after the described transformation and pair selection.
The abstract invokes a transformed ECOD without specifying changes to its core statistical assumptions.

pith-pipeline@v0.9.0 · 5523 in / 1192 out tokens · 73851 ms · 2026-05-15T03:15:26.311259+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Rodríguez-Mazahua, C.-A

L. Rodríguez-Mazahua, C.-A. Rodríguez-Enríquez, J. L. Sánchez- Cervantes, J. Cervantes, J. L. García-Alcaraz, G. Alor-Hernández, A general perspective of big data: applications, tools, challenges and trends, The Journal of Supercomputing 72 (8) (2016) 3073–3113

work page 2016
[2]

Hamzehi, S

M. Hamzehi, S. Hosseini, Business intelligence using machine learning algorithms, Multimedia tools and applications 81 (23) (2022) 33233– 33251

work page 2022
[3]

M. Rath, Realization of business intelligence using machine learning, In- ternet of Things in Business Transformation: Developing an Engineering and Business Strategy for Industry 5.0 (2021) 169–184

work page 2021
[4]

Ridzuan, W

F. Ridzuan, W. M. N. W. Zainon, Diagnostic analysis for outlier de- tection in big data analytics, Procedia Computer Science 197 (2022) 685–692

work page 2022
[5]

Larrañaga, D

P. Larrañaga, D. Atienza, J. Diaz-Rozo, A. Ogbechie, C. E. Puerto- Santana, C. Bielza, Industrial applications of machine learning, CRC press, 2018

work page 2018
[6]

Carou, A

D. Carou, A. Sartal, J. P. Davim, Machine learning and artificial intel- ligence with industrial applications, Springer. doi 10 (2022) 978–3

work page 2022
[7]

Klipa, I

D. Klipa, I. Ristić, A. Radonjić, I. Scepanović, et al., Big data and artificial intelligence, International Journal of Management Trends: Key Concepts and Research 1 (1) (2022) 3–14

work page 2022
[8]

N.K.Shah, Z.Li, M.G.Ierapetritou, Petroleumrefiningoperations: key issues, advances, and opportunities, Industrial & Engineering Chemistry Research 50 (3) (2011) 1161–1170

work page 2011
[9]

Grossmann, Enterprise-wide optimization: A new frontier in process systems engineering, AIChE Journal 51 (7) (2005) 1846–1857

I. Grossmann, Enterprise-wide optimization: A new frontier in process systems engineering, AIChE Journal 51 (7) (2005) 1846–1857

work page 2005
[10]

Venkatasubramanian, The promise of artificial intelligence in chemi- cal engineering: Is it here, finally?, AIChE Journal 65 (1) (2019)

V. Venkatasubramanian, The promise of artificial intelligence in chemi- cal engineering: Is it here, finally?, AIChE Journal 65 (1) (2019). 32

work page 2019
[11]

Thakur, Z

R. Thakur, Z. Sajid, F. Khan, Artificial intelligence (ai) safety system for safe & trustworthy autonomy, Digital Chemical Engineering (2026) 100308

work page 2026
[12]

Aspen unified pims,https://www.aspentech.com/en/products/msc/ aspen-unified-pims

work page
[13]

J. M. Pinto, L. F. L. Moro, A planning model for petroleum refineries, Brazilian Journal of Chemical Engineering 17 (2000) 575–586

work page 2000
[14]

Bertsimas, J

D. Bertsimas, J. N. Tsitsiklis, Introduction to linear optimization, Vol. 6, Athena scientific Belmont, MA, 1997

work page 1997
[15]

Harjunkoski, C

I. Harjunkoski, C. T. Maravelias, P. Bongers, P. M. Castro, S. Engell, I. E. Grossmann, J. Hooker, C. Méndez, G. Sand, J. Wassick, Scope for industrial applications of production scheduling models and solution methods, Computers & Chemical Engineering 62 (2014) 161–193

work page 2014
[16]

Venkatasubramanian, R

V. Venkatasubramanian, R. Rengaswamy, K. Yin, S. N. Kavuri, A review of process fault detection and diagnosis: Part i: Quantitative model-based methods, Computers & chemical engineering 27 (3) (2003) 293–311

work page 2003
[17]

T. T. Dang, H. Y. Ngan, W. Liu, Distance-based k-nearest neighbors outlier detection method in large-scale traffic data, in: 2015 IEEE In- ternational Conference on Digital Signal Processing (DSP), IEEE, 2015, pp. 507–510

work page 2015
[18]

S. J. Qin, Survey on data-driven industrial process monitoring and di- agnosis, Annual reviews in control 36 (2) (2012) 220–234

work page 2012
[19]

Kriegel, M

H.-P. Kriegel, M. Schubert, A. Zimek, Angle-based outlier detection in high-dimensional data, in: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 444–452

work page 2008
[20]

Z.Li, Y.Zhao, X.Hu, N.Botta, C.Ionescu, G.H.Chen, Ecod: Unsuper- vised outlier detection using empirical cumulative distribution functions, IEEE Transactions on Knowledge and Data Engineering 35 (12) (2022) 12181–12193. 33

work page 2022
[21]

Horváth, E

G. Horváth, E. Kovács, R. Molontay, S. Nováczki, Copula-based anomaly scoring and localization for large-scale, high-dimensional con- tinuous data, ACM Transactions on Intelligent Systems and Technology (TIST) 11 (3) (2020) 1–26

work page 2020
[22]

C. Chow, C. Liu, Approximating discrete probability distributions with dependence trees, IEEE transactions on Information Theory 14 (3) (1968) 462–467

work page 1968
[23]

2021 suez canal obstruction,https://www.bbc.com/news/ world-middle-east-56505413. 34

work page 2021