HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series

(2) Forgis; (3) University of Vienna); Camilla Mazzoleni; Federico Martelli; Gian-Alessandro Lombardi; Jonas Petersen; Philipp Petersen; Philipp Petersen (3) ((1) ETH Zurich; Riccardo Maggioni

arxiv: 2605.11130 · v4 · pith:UEMSOXGLnew · submitted 2026-05-11 · 💻 cs.LG · cs.AI

HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series

Jonas Petersen , Gian-Alessandro Lombardi , Riccardo Maggioni , Camilla Mazzoleni , Federico Martelli , Philipp Petersen , Philipp Petersen (3) ((1) ETH Zurich , (2) Forgis

show 1 more author

(3) University of Vienna)

This is my paper

Pith reviewed 2026-05-14 21:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-supervised learningtime seriesevent predictionsurvival analysistransformerJEPAhorizon conditioningmultivariate forecasting

0 comments

The pith

HEPA pretrains a causal Transformer on unlabeled time series by forecasting future representations at chosen horizons, then freezes the encoder to output accurate survival CDFs for rare events with far less labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a self-supervised Joint-Embedding Predictive Architecture conditioned on prediction horizons can extract useful temporal structure from unlabeled multivariate series. A causal Transformer encoder is trained so that a separate predictor can forecast its future representations rather than raw values, after which the encoder is frozen and a lightweight head is tuned to produce monotonic survival cumulative distribution functions for target events. This single fixed architecture and hyperparameter set is shown to exceed PatchTST, iTransformer, MAE, and Chronos-2 on at least ten of fourteen benchmarks spanning water contamination, cyberattacks, volatility shifts, and eight other event types across eleven domains while using an order of magnitude fewer tuned parameters and, on lifecycle data, an order of magnitude less labeled supervision.

Core claim

A causal Transformer encoder pretrained via horizon-conditioned JEPA learns representations whose future states are predictable from unlabeled sequences alone; freezing this encoder and finetuning only the attached predictor then yields monotonic survival CDFs over horizons that accurately locate critical events, delivering superior benchmark performance with fixed hyperparameters and drastically reduced labeled data.

What carries the argument

Horizon-conditioned JEPA pretraining in which the encoder must produce representations that a separate network can forecast at arbitrary future horizons, thereby capturing predictable dynamics without labels.

If this is right

Rare critical events become predictable in domains where labeling is costly because the method relies primarily on abundant unlabeled series.
A single architecture and hyperparameter choice suffices for eleven distinct application domains without per-domain redesign.
Event prediction accuracy exceeds that of PatchTST, iTransformer, MAE, and Chronos-2 on at least ten of fourteen standard benchmarks.
Tuned-parameter count drops by roughly an order of magnitude relative to fully supervised competitors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pretraining pattern could be applied to other scarce-label sequential tasks such as medical arrhythmia forecasting or industrial fault detection.
Because the output is a full monotonic survival CDF rather than a single probability, downstream systems could directly incorporate calibrated horizon-specific risk thresholds.
Variable or learned horizon sampling during pretraining might further improve robustness across differing sampling rates or event timescales.

Load-bearing premise

Representations learned by horizon-conditioned JEPA pretraining on unlabeled data will transfer effectively to accurate event-specific survival CDF prediction after the encoder is frozen, without domain-specific architectural changes or extensive hyperparameter search.

What would settle it

A controlled experiment on an additional rare-event dataset in which the frozen HEPA encoder produces lower event-prediction accuracy than a randomly initialized encoder of the same size or than the leading baselines would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2605.11130 by (2) Forgis, (3) University of Vienna), Camilla Mazzoleni, Federico Martelli, Gian-Alessandro Lombardi, Jonas Petersen, Philipp Petersen, Philipp Petersen (3) ((1) ETH Zurich, Riccardo Maggioni.

**Figure 1.** Figure 1: One label-efficient architecture, domain- and event-agnostic. (a) h-AUROC (↑; horizonaveraged AUROC) across 14 benchmarks in 11 domains. HEPA wins on 10 out of 14 at full labels; at 10% labels (open circles) it retains ≥92% of full-label performance on lifecycle datasets. (b) Predicted probability surfaces p(t, ∆t) for turbofan degradation (top) and cardiac arrhythmia (bottom). Predictive Architecture (JE… view at source ↗

**Figure 2.** Figure 2: HEPA architecture. Both stages sweep over all (t, ∆t) pairs per episode. Stage 1: The causal encoder fθ maps x≤t to ht; the predictor gϕ(ht, ∆t) predicts future representations via a self-supervised JEPA objective. Stage 2: Encoder frozen; the predictor produces K horizon-specific hazard rates λ∆t composed into a survival CDF (cumulative distribution function) p(t, ∆t). collapse Hˆ = H∗ =const is prevented… view at source ↗

**Figure 3.** Figure 3: Self-supervised pretraining learns task-relevant structure. (a) Pretraining loss ε vs. downstream h-AUROC (↑) at fixed checkpoints across three domains (C-MAPSS-3: ρ=−0.67; MBA: ρ=−0.64; SMAP: ρ=−0.49; 3 seeds, error bars ±1 std). Within a dataset, L, Cη, and I(H⋆ ; Et+∆t) are constant, so the bound’s monotone prediction is directly testable. ⋆ marks the converged-best snapshot; ε scales differ across data… view at source ↗

**Figure 4.** Figure 4: Evaluation framework. (a) The probability surface p(t, ∆t) on a representative CMAPSS-1 engine (lifetime 174 cycles) unifies all event-prediction metrics as lossy projections. The colour scale matches Fig. 1b. RMSE requires converting the survival curve to a point estimate P τˆ = ∆t ∆t·P(event at ∆t); this projection is sensitive to calibration (section J). PA-F1 thresholds p(t, 1) at the smallest horizon… view at source ↗

**Figure 5.** Figure 5: Predictor outputs in latent space (C-MAPSS-1). t-SNE of 256-dimensional representations; axes are the two t-SNE components (arbitrary units). Blue: encoder output ht. Light to dark red: predicted representations at horizons k = 10, 50, 100. Outputs shift progressively with k. anomaly segments are long and prevalence is non-trivial: the TSAD-Eval study [30] shows that a random detector can exceed F1 = 0.9 … view at source ↗

read the original abstract

Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is scarce because such events are rare and costly to annotate. We introduce HEPA (Horizon-conditioned Event Predictive Architecture), built on two key principles. First, a causal Transformer encoder is pretrained via a Joint-Embedding Predictive Architecture (JEPA): a horizon-conditioned predictor learns to forecast future representations rather than future values, forcing the encoder to capture predictable temporal dynamics from unlabeled data alone. Second, we freeze the encoder and finetune only the predictor toward the target event, producing a monotonic survival cumulative distribution function (CDF) over horizons. With fixed architecture and optimiser hyperparameters across all benchmarks, HEPA handles water contamination, cyberattack detection, volatility regimes, and eight further event types across 11 domains, exceeding leading time-series architectures including PatchTST, iTransformer, MAE, and Chronos-2 on at least 10 of 14 benchmarks, with an order of magnitude fewer tuned parameters and, on lifecycle datasets, an order of magnitude less labeled data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HEPA's fixed-hyperparameter JEPA pretraining plus frozen-encoder survival-CDF predictor works across 14 benchmarks with little tuning, but the baseline comparisons need verification on whether they used the same fixed settings.

read the letter

The new piece is the horizon-conditioned predictor inside a JEPA pretraining loop that forces the encoder to learn predictable dynamics from unlabeled data, followed by freezing the encoder and training only the head to emit monotonic survival CDFs for event timing. That combination is not in the cited baselines like PatchTST or Chronos-2, and the fixed-architecture, fixed-optimizer setup across 11 domains is a practical step forward for rare-event settings where labels are scarce.

Referee Report

2 major / 1 minor

Summary. The paper introduces HEPA, a self-supervised architecture for rare-event prediction in multivariate time series. It pretrains a causal Transformer encoder via horizon-conditioned Joint-Embedding Predictive Architecture (JEPA) on unlabeled data, where a predictor forecasts future representations rather than raw values. The encoder is then frozen and only the predictor is finetuned to produce a monotonic survival CDF over prediction horizons. With a single fixed architecture and optimizer hyperparameter set across all tasks, the method is evaluated on 14 benchmarks spanning 11 domains (water contamination, cyberattack detection, volatility regimes, and eight additional event types) and is reported to outperform PatchTST, iTransformer, MAE, and Chronos-2 on at least 10 benchmarks while using an order of magnitude fewer tuned parameters and, on lifecycle datasets, an order of magnitude less labeled data.

Significance. If the performance claims hold under matched evaluation conditions, HEPA would provide a practical, parameter-efficient route to event prediction in label-scarce regimes by transferring representations learned from unlabeled data. The attempt to hold architecture and optimizer hyperparameters fixed across heterogeneous domains is a notable strength that, if substantiated, would strengthen evidence of architectural robustness rather than tuning artifacts. The reduction in required labeled data on lifecycle tasks could be impactful in domains where annotations are expensive.

major comments (2)

[Abstract] Abstract: the headline claim that HEPA exceeds the listed baselines on ≥10/14 benchmarks 'with fixed architecture and optimiser hyperparameters across all benchmarks' and 'an order of magnitude fewer tuned parameters' is load-bearing for the central contribution, yet the manuscript supplies no explicit statement, table, or appendix confirming that PatchTST, iTransformer, MAE, and Chronos-2 were evaluated under the identical fixed-hyperparameter regime rather than their conventional per-benchmark tuning. Without this matched-condition evidence the reported gap cannot be unambiguously attributed to the HEPA design.
[§4–§5] Evaluation protocol (throughout §4–§5): the abstract asserts clear empirical superiority but the manuscript provides no details on statistical testing (significance levels, number of random seeds, variance across runs), benchmark construction, train/validation/test splits, or ablation studies isolating the contribution of the horizon-conditioned JEPA pretraining versus the survival-CDF head. These omissions prevent assessment of whether the gains are robust or sensitive to implementation choices.

minor comments (1)

[Abstract] The abstract refers to 'eight further event types' without enumerating them; a short list or reference to the benchmark table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for explicit confirmation of evaluation conditions and greater transparency in the experimental protocol. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that HEPA exceeds the listed baselines on ≥10/14 benchmarks 'with fixed architecture and optimiser hyperparameters across all benchmarks' and 'an order of magnitude fewer tuned parameters' is load-bearing for the central contribution, yet the manuscript supplies no explicit statement, table, or appendix confirming that PatchTST, iTransformer, MAE, and Chronos-2 were evaluated under the identical fixed-hyperparameter regime rather than their conventional per-benchmark tuning. Without this matched-condition evidence the reported gap cannot be unambiguously attributed to the HEPA design.

Authors: We agree that the current manuscript lacks an explicit statement confirming the hyperparameter regime for the baselines. In the experiments, the same fixed architecture and optimizer hyperparameters were applied uniformly to HEPA and all baselines (PatchTST, iTransformer, MAE, Chronos-2) to enable direct comparison. We will revise the abstract to include a brief clarifying clause and add a new table in the appendix that lists the exact hyperparameter values used for every method, explicitly stating that they were held constant across all 14 benchmarks. This will make the matched-condition evidence unambiguous and allow readers to attribute performance differences to the HEPA design. revision: yes
Referee: [§4–§5] Evaluation protocol (throughout §4–§5): the abstract asserts clear empirical superiority but the manuscript provides no details on statistical testing (significance levels, number of random seeds, variance across runs), benchmark construction, train/validation/test splits, or ablation studies isolating the contribution of the horizon-conditioned JEPA pretraining versus the survival-CDF head. These omissions prevent assessment of whether the gains are robust or sensitive to implementation choices.

Authors: We acknowledge the absence of these details in the current version. We will expand Sections 4 and 5 with the following additions: (i) statistical testing details including the use of 5 random seeds, reported standard deviations, and paired t-tests with p<0.05 significance threshold; (ii) explicit descriptions of benchmark construction, data splits (train/validation/test ratios), and preprocessing steps; and (iii) new ablation studies that isolate the horizon-conditioned JEPA pretraining (by comparing against a non-pretrained encoder) and the survival-CDF head (by comparing against a direct regression head). These revisions will allow readers to evaluate the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core chain consists of (1) self-supervised JEPA pretraining on unlabeled data only, where a horizon-conditioned predictor learns to match future representations (independent of any downstream event labels), followed by (2) freezing the encoder and finetuning solely the predictor head on labeled survival targets. No equation reduces a reported performance metric to a quantity defined by fitting on the target task itself, and no load-bearing step invokes a self-citation, uniqueness theorem, or ansatz imported from prior author work. The fixed-hyperparameter claim across benchmarks is an empirical protocol, not a definitional reduction. This is the standard non-circular self-supervised transfer pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard self-supervised learning assumption that representation prediction on unlabeled data yields transferable temporal features, plus the architectural choice of a causal Transformer and monotonic CDF head. No new physical entities are postulated.

axioms (1)

domain assumption A causal Transformer encoder pretrained via horizon-conditioned representation prediction will capture generalizable temporal dynamics from unlabeled multivariate time series.
Invoked to justify the JEPA pretraining stage as sufficient for downstream event prediction.

invented entities (1)

Horizon-conditioned predictor no independent evidence
purpose: To forecast future representations at variable horizons and produce a monotonic survival CDF for the target event after encoder freezing.
New component introduced to convert the pretrained encoder into an event-timing model.

pith-pipeline@v0.9.0 · 5538 in / 1482 out tokens · 55427 ms · 2026-05-14T21:03:06.915737+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a horizon-conditioned predictor learns to forecast future representations rather than future values... producing a monotonic survival cumulative distribution function (CDF) over horizons
IndisputableMonolith/Foundation/AbsoluteFloorClosure absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1 (Event-Information Retention) ... I(Ht;Et+Δt) ≥ I(H∗;Et+Δt) − Cη L² ε

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.