From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

David Leko; Leandro von Krannichfeldt; Lev Telyatnikov; Ludovico Comito; Olga Fink; Raffael Theiler

arxiv: 2605.28371 · v1 · pith:JMSUL6GOnew · submitted 2026-05-27 · 💻 cs.AI · cs.LG· cs.SE

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Raffael Theiler , Ludovico Comito , David Leko , Leandro Von Krannichfeldt , Lev Telyatnikov , Olga Fink This is my paper

Pith reviewed 2026-06-29 12:11 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE

keywords agentic reproductionframework-based reproductionPHM paper reproductionslot-binding interfacebenchmark comparabilityunder-specified methodsmachine health intelligenceassumption-aware implementation

0 comments

The pith

Coupling an agent with a shared PHM benchmark framework turns under-specified paper methods into executable, assumption-aware, and cross-comparable implementations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing paper-to-code tools produce isolated implementations that cannot be fairly compared because papers leave out key choices such as windowing, target definitions, and data splits. It proposes an agent that reads a paper and binds its descriptions to a common framework through a slot-binding interface, which records any unresolved assumptions. The resulting code is then checked against standardized task contracts and evaluation hooks. Evaluation on 16 PHM papers indicates that this framework-enhanced approach improves reproduction success and enables systematic benchmarking across papers under identical protocols.

Core claim

An agent translates each paper into a shared PHM benchmark framework by mapping equations and protocol descriptions into structured components (task definitions, dataset adapters, windowing, targets, models, evaluators) via a slot-binding interface that explicitly records unresolved assumptions; the resulting implementations are validated against standardized task contracts, turning isolated code synthesis into assumption-aware and systematically comparable benchmark code.

What carries the argument

The slot-binding interface, which maps paper elements (equations, preprocessing steps, evaluation protocols) into shared framework components while logging open assumptions.

If this is right

Reproductions become directly executable inside the same benchmark harness and can be validated against fixed task contracts.
Assumptions about windowing, targets, and splits are recorded explicitly, so later users know exactly what was chosen.
Cross-paper comparisons become possible under one set of standardized evaluation hooks instead of each paper's private protocol.
The same agent-plus-framework pattern can be applied to any domain where papers leave critical design choices under-specified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the binding step proves reliable, the method could reduce the need for manual re-implementation when new papers appear in the same field.
Domains outside PHM that also suffer from restricted data access and incomplete reporting might adopt the slot-binding pattern to create their own shared benchmarks.
The recorded assumptions could themselves become a research output, showing which paper elements most often require human judgment.

Load-bearing premise

The slot-binding interface can translate incomplete paper descriptions into framework components without adding new inconsistencies or biases that would undermine fair comparison across papers.

What would settle it

Run the same 16 papers through the agent once with the shared framework and once without it, then measure whether performance rankings or absolute scores change when the only difference is the binding step rather than the original method.

Figures

Figures reproduced from arXiv: 2605.28371 by David Leko, Leandro von Krannichfeldt, Lev Telyatnikov, Ludovico Comito, Olga Fink, Raffael Theiler.

**Figure 1.** Figure 1: Visual abstract. The workflow converts an under-specified PHM paper into a benchmarkready artifact in a shared framework. The agent ingests the paper, analyzes specified and inferred method elements, maps them to framework slots, implements a resolved configuration or extension stubs, verifies contracts and empirical behavior, and returns an artifact or auditable failure report. S marks information specif… view at source ↗

**Figure 2.** Figure 2: Distribution of generated files and generated code lines across methods, with box plots [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Representative experiment YAML emitted by [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Binding state counts per method across the 16-paper corpus. Each bar segment shows [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Binding-score vs. judge-score heatmap overlaying the number of successful and failed runs [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of token statistics for FCA (staged, in-framework), AiF/AiF-D (prompt-only, in-framework), and SA/DC (standalone). Compact token usage view combining session duration and selected tokens per run/session in one joint plot. The central panel shows paired session observations, while the top and right marginal axes show the corresponding duration and token-volume distributions. (a) ChatGPT 5.4. (b) Ki… view at source ↗

**Figure 7.** Figure 7: Code Ratings by LLM Judges averaged across criteria categories [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Code ratings by LLM judges for the criteria categories [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Abridged prompt for /implement-model. The full skill includes Python templates for backbone, wrapper, and config plus a checklist enforced before the file is written. Model Sanity Verification Skill Prompt (abridged; verification family) [Role] Run the standardized TabPHM sanity tools on a freshly implemented model. Do not write one-off sanity scripts during a paper-validation run. The skill assumes /verif… view at source ↗

**Figure 10.** Figure 10: Abridged prompt for /verify-sanity. The full skill exposes per-check on-demand tools (tabphm_sanity_init_loss, tabphm_sanity_gradient_flow, tabphm_sanity_overfit_batch, plus opt-in zero_input and subset_convergence) for targeted reruns. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: Abridged prompt for /diagnose-verify-block. The companion /diagnose-training-result loop applies the same global-hypothesis structure to disputed paper claims after training, with a budget of four iterations. Post-Training Results Evaluation Skill Prompt (abridged; reporting family) [Role] After training completes, assess whether the implemented model behaves correctly and is competitive within F, AND ju… view at source ↗

**Figure 12.** Figure 12: Abridged prompt for /evaluate-results. The full skill includes the magnitude-sanity bands per task family and the procedure for handling missing or normalized baseline metrics. E.2 Agent System Prompts 31 [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Abridged system prompt for the primary orchestrator. Tool surface and per-phase file [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: Abridged system prompt for CONCEPTUAL-ANALYSIS. The Markdown rendering shape (structure map, novelty assessment, dataset mapping, method decomposition, integration roadmap) is omitted. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

**Figure 15.** Figure 15: Abridged system prompt for ALGORITHMIC-SPEC. The detailed Markdown shape and the full nine-row hyperparameter table template are omitted. E.3 Prompt-Only Baseline Prompts Agent-in-Framework baseline prompt [Role] You are given a research paper and this codebase. [Goal] Turn the paper into an implementation in this repository that is ready to evaluate. [Inputs] Paper: path_to_paper.pdf. [Procedure] Read th… view at source ↗

**Figure 16.** Figure 16: Single-prompt instruction issued to the in-framework prompt-only baseline. The agent [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗

**Figure 17.** Figure 17: Single-prompt instruction issued to the standalone prompt-only baseline. The agent [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Full Judge prompt used in our experiments. Runnability Audit Prompt [Role] You are auditing a set of paper-implementation repositories. [Goal] Determine which repositories run successfully and which do not. [Rules] Do not modify code. Do not patch, refactor, delete, commit, or change source/configuration files. Only read files and run available documented commands or obvious run scripts. Be fair: follow e… view at source ↗

**Figure 19.** Figure 19: Full runnability-audit prompt used to produce the binary binding-state counts reported in Section 4. Unlike the reference-free judge prompt of [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗

read the original abstract

Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning: translating published papers into executable, benchmark-ready implementations. Reproducing under-specified methods in PHM is particularly difficult due to restricted access to industrial datasets, incomplete reporting of preprocessing and evaluation protocols, and implicit design choices (e.g., windowing, target construction, data splits) that critically affect performance. Existing paper-to-code systems generate implementations for individual papers, but these artifacts are often not directly comparable due to inconsistencies in assumptions and evaluation settings. We introduce \emph{agentic, framework-based PHM paper reproduction}, where an agent translates a paper into a shared PHM benchmark framework via a \emph{slot-binding interface}. This interface maps equations and protocol descriptions into structured components (task definitions, dataset adapters, windowing, targets, models, and evaluators), while explicitly recording unresolved assumptions. The resulting implementations are validated against standardized task contracts and evaluation hooks, enabling consistent and comparable benchmarking. We evaluate this approach on 16 PHM papers, comparing framework-enhanced, skill-based and prompt-based agentic reproduction against a recent framework-free paper-reproduction agent. We assess reproduction success, model-based code evaluation, framework binding of paper assumptions, and cross-paper benchmark comparability under standardized protocols. Our results show that coupling agentic generation with a shared framework transforms paper reproduction from isolated code synthesis into executable, assumption-aware, and systematically comparable benchmark implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes agentic binding of PHM papers into a shared framework to improve reproducibility and comparability, but the abstract supplies no metrics or checks that the bindings preserve original intent.

read the letter

The main takeaway is that this work tries to fix a practical reproducibility problem in PHM by routing paper descriptions through a shared benchmark framework via an agent and a slot-binding interface. The interface is meant to map things like windowing, targets, and splits while logging assumptions, turning isolated reproductions into comparable runs.

What is new is the combination of agentic generation with a domain-specific PHM framework and explicit slot-binding for under-specified elements. Earlier paper-to-code tools stop at generating standalone code; this adds the shared structure and the binding step to enforce consistency across papers.

The paper does a solid job naming the real barriers in this domain: restricted industrial data, missing preprocessing details, and implicit choices that break comparisons. The plan to test on 16 papers, with comparisons across agent modes and against a framework-free baseline, shows they considered multiple evaluation angles including code quality and standardized protocols.

The soft spot is the complete absence of numbers. The abstract states that results show improvement in success and comparability, yet gives no quantitative metrics, baselines, or analysis of binding decisions. The slot-binding step is load-bearing for the comparability claim, and without evidence that choices for splits or windowing do not introduce systematic shifts, the advantage over isolated code generation stays unproven. The stress-test concern lands here.

This is aimed at PHM researchers who need to compare methods on industrial tasks. A reader focused on applied reproducibility in narrow domains could extract useful design ideas from the framework and assumption-recording approach.

The work shows clear thinking about the problem and direct engagement with prior limitations. It deserves peer review because the issue is concrete and the proposal is specific, even though the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper proposes an agentic, framework-based approach to reproducing under-specified methods from PHM papers. An agent uses a slot-binding interface to map paper elements (equations, protocols, implicit choices like windowing/targets/splits) into components of a shared PHM benchmark framework while recording unresolved assumptions. The resulting implementations are validated against task contracts, and the work claims this transforms isolated code synthesis into executable, assumption-aware, and systematically comparable benchmarks. Evaluation is described on 16 PHM papers, comparing framework-enhanced agents against framework-free baselines on reproduction success, code evaluation, binding fidelity, and cross-paper comparability.

Significance. If the empirical claims hold, the work would offer a practical advance in reproducibility for applied ML domains with restricted data and under-specified protocols. The shared-framework strategy directly targets the comparability problem that isolated paper-to-code systems leave unsolved, and the explicit recording of assumptions is a constructive step toward falsifiable benchmarks.

major comments (2)

[Abstract] Abstract: the claim that results on 16 papers show improvement in reproduction success and comparability is unsupported by any quantitative metrics, baselines, or error analysis. Without these data the central empirical assertion cannot be evaluated.
[Abstract / method description] Slot-binding interface description: the interface is asserted to map under-specified elements (windowing, targets, splits) into framework components while preserving intent and avoiding new biases. No mechanism, example, or validation is supplied to show that binding decisions remain consistent across papers or do not introduce systematic shifts that would invalidate the 'systematically comparable' conclusion.

minor comments (1)

[Abstract] The abstract would be strengthened by a single sentence reporting the key quantitative outcomes (e.g., success rates or comparability scores) rather than a qualitative summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address the major comments point-by-point below and will revise the manuscript to strengthen the abstract and method sections with additional details.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that results on 16 papers show improvement in reproduction success and comparability is unsupported by any quantitative metrics, baselines, or error analysis. Without these data the central empirical assertion cannot be evaluated.

Authors: The full evaluation section reports quantitative metrics on the 16 papers, including reproduction success rates with framework-enhanced agents versus the framework-free baseline, model-based code evaluation scores, binding fidelity measures, and cross-paper comparability statistics under standardized protocols, along with baseline comparisons and error analysis. We will revise the abstract to include specific quantitative results and references to these analyses. revision: yes
Referee: [Abstract / method description] Slot-binding interface description: the interface is asserted to map under-specified elements (windowing, targets, splits) into framework components while preserving intent and avoiding new biases. No mechanism, example, or validation is supplied to show that binding decisions remain consistent across papers or do not introduce systematic shifts that would invalidate the 'systematically comparable' conclusion.

Authors: We agree that the current description is high-level. We will expand the method section to include the explicit slot-binding mechanism and rules, a worked example from a PHM paper showing binding of windowing/targets/splits, and validation results demonstrating cross-paper consistency (e.g., agreement metrics) and lack of systematic bias (e.g., performance sensitivity analysis). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical proposal evaluated against external baseline

full rationale

The paper introduces an agentic framework-based reproduction method for PHM papers and reports an empirical evaluation on 16 papers, comparing framework-enhanced agents against a framework-free baseline. No equations, fitted parameters, or derivations are present. The central claim rests on described experimental outcomes rather than any self-referential reduction or self-citation chain. The slot-binding interface is presented as a design choice whose effects are assessed via the reported experiments, not assumed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on the domain assumption that papers can be decomposed into the listed framework slots; the slot-binding interface is an invented mechanism without independent evidence outside this work.

axioms (1)

domain assumption Under-specified methods in PHM papers can be mapped to structured components (task definitions, dataset adapters, windowing, targets, models, evaluators) via slot-binding while recording unresolved assumptions.
Core premise stated in the abstract description of the interface.

invented entities (1)

slot-binding interface no independent evidence
purpose: Maps paper equations and protocol descriptions into framework components while explicitly recording unresolved assumptions.
Newly introduced mechanism that enables the framework-based reproduction.

pith-pipeline@v0.9.1-grok · 5821 in / 1249 out tokens · 29158 ms · 2026-06-29T12:11:51.461790+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models
cs.LG 2026-06 unverdicted novelty 6.0

Tabular foundation models applied to PHM via signal-to-table conversion achieve the best average ranks across prognostic and diagnostic tasks and remain competitive in low-data regimes.

Reference graph

Works this paper leans on

49 extracted references · 41 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Picid: A modular evaluation infrastructure for reproducible phm across tasks and domains,

Anonymous. Picid: A modular evaluation infrastructure for reproducible phm across tasks and domains,
[2]

Under anonymous review; citation details blinded for double-blind compliance
[3]

To charge or to sell? EV pack useful life estimation via LSTMs, CNNs, and autoencoders.Energies, 16(6):2837, 2023

Michael Bosello, Carlo Falcomer, Claudio Rossi, and Giovanni Pau. To charge or to sell? EV pack useful life estimation via LSTMs, CNNs, and autoencoders.Energies, 16(6):2837, 2023. doi: 10.3390/ en16062837. URLhttps://doi.org/10.3390/en16062837

work page doi:10.3390/en16062837 2023
[4]

Fault prognosis of turbofan engines: Eventual failure prediction and remaining useful life estimation.International Journal of Prognostics and Health Management, 14(2), 2023

Joseph Cohen, Xun Huan, and Jun Ni. Fault prognosis of turbofan engines: Eventual failure prediction and remaining useful life estimation.International Journal of Prognostics and Health Management, 14(2), 2023. doi: 10.36001/ijphm.2023.v14i2.3486. URL https://doi.org/10.36001/ijphm.2023.v14i2.3486

work page doi:10.36001/ijphm.2023.v14i2.3486 2023
[5]

Marker: Convert documents to markdown, JSON, chunks, and HTML

Datalab. Marker: Convert documents to markdown, JSON, chunks, and HTML. https://github.com/ datalab-to/marker, 2026. URL https://github.com/datalab-to/marker. Software repository; accessed 2026-05-27

2026
[6]

Ingeborg de Pater and Mihaela Mitici. Developing health indicators and RUL prognostics for systems with few failure instances and varying operating conditions using a LSTM autoencoder.Engineering Applications of Artificial Intelligence, 117:105582, 2023. doi: 10.1016/j.engappai.2022.105582. URL https://doi.org/10.1016/j.engappai.2022.105582

work page doi:10.1016/j.engappai.2022.105582 2023
[7]

Causal inference-based fault diagnosis and abnormal degradation detection for aero-engine

Kunyu Dong, Dan Xu, Zhaoyang Zeng, and Qingyu Zhu. Causal inference-based fault diagnosis and abnormal degradation detection for aero-engine. InProceedings of the 2025 International Conference on Equipment Intelligent Operation and Maintenance, pages 1422–1428, 2025. doi: 10.1109/ICEIOM65271. 2025.11239779. URLhttps://doi.org/10.1109/ICEIOM65271.2025.11239779

work page doi:10.1109/iceiom65271 2025
[8]

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Kuruvilla, Rachel Johnson, and Michio Inoue

Russell Graves, Peeyush Pankaj, Vineet J. Kuruvilla, Rachel Johnson, and Michio Inoue. Data-driven prognostics and diagnostics of industrial machinery – a turbofan engine case study. InProceedings of the Asia Pacific Conference of the PHM Society, 2023. doi: 10.36001/phmap.2023.v4i1.3690. URL https://doi.org/10.36001/phmap.2023.v4i1.3690

work page doi:10.36001/phmap.2023.v4i1.3690 2023
[10]

Argos: Agentic time-series anomaly detection with autonomous rule generation via large language models, 2025

Yile Gu et al. Argos: Agentic time-series anomaly detection with autonomous rule generation via large language models, 2025. URLhttps://arxiv.org/abs/2501.14170

work page arXiv 2025
[11]

A comparison of residual-based methods on fault detection

Chi-Ching Hsu, Gaetan Frusque, and Olga Fink. A comparison of residual-based methods on fault detection. InAnnual Conference of the PHM Society, 2023. doi: 10.36001/phmconf.2023.v15i1.3444. URLhttps://doi.org/10.36001/phmconf.2023.v15i1.3444

work page doi:10.36001/phmconf.2023.v15i1.3444 2023
[12]

Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber

Tianyu Hua, Harper Hua, Violet Xiang, Benjamin Klieger, Sang T. Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber. Researchcodebench: Benchmarking llms on implementing novel machine learning research code, 2025. URLhttps://arxiv.org/abs/2506.02314

work page arXiv 2025
[13]

AutoIAD: Manager-driven multi-agent collaboration for automated industrial anomaly detection, 2025

Dongwei Ji, Bingzhang Hu, and Yi Zhou. AutoIAD: Manager-driven multi-agent collaboration for automated industrial anomaly detection, 2025. URLhttps://arxiv.org/abs/2508.05503

work page arXiv 2025
[14]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: AI-Driven Exploration in the Space of Code, February 2025. URL http://arxiv.org/abs/ 2502.13138. arXiv:2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

A recipe for training neural networks

Andrej Karpathy. A recipe for training neural networks. https://karpathy.github.io/2019/04/ 25/recipe/, 2019. Blog post; accessed 2026-04-28. 10

2019
[16]

From reproduction to replication: Evaluating research agents with progressive code masking, 2025

Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, and Daniel Fried. From reproduction to replication: Evaluating research agents with progressive code masking, 2025. URL https://arxiv. org/abs/2506.19724

work page arXiv 2025
[17]

Minh Le-Anh, Huyen Nguyen, Khanh An Tran, Nam Le Hai, Linh Ngo Van, Nghi D. Q. Bui, and Bach Le. Do not treat code as natural language: Implications for repository-level code generation and beyond, 2026. URLhttps://arxiv.org/abs/2602.11671

work page arXiv 2026
[18]

PHM-Vibench: A unified and factory-style vibration benchmarking framework for the foundation model era

Qi Li, Bojian Chen, Xuan Li, Qitong Chen, Liang Chen, Changqing Shen, Lu Lu, Zhaoye Qin, and Fulei Chu. PHM-Vibench: A unified and factory-style vibration benchmarking framework for the foundation model era. InProceedings of the Asia Pacific Conference of the PHM Society 2025, 2025. URL https: //papers.phmsociety.org/index.php/phmap/article/view/4303

2025
[19]

Deepcode: Open agentic coding,

Zongwei Li, Zhonghang Li, Zirui Guo, Xubin Ren, and Chao Huang. Deepcode: Open agentic coding,
[20]

URLhttps://arxiv.org/abs/2512.07921

work page arXiv
[21]

Lin Lin, Jinlei Wu, Song Fu, Sihao Zhang, Changsheng Tong, and Lizheng Zu. Channel attention & temporal attention based temporal convolutional network: A dual attention framework for remaining useful life prediction of the aircraft engines.Advanced Engineering Informatics, 60:102372, 2024. doi: 10.1016/j.aei.2024.102372. URLhttps://doi.org/10.1016/j.aei.2...

work page doi:10.1016/j.aei.2024.102372 2024
[22]

Autop2c: An llm-based agent framework for code repository generation from multimodal content in academic papers, 2025

Zijie Lin, Yiqing Shen, Qilin Cai, He Sun, Jinrui Zhou, and Mingjun Xiao. Autop2c: An llm-based agent framework for code repository generation from multimodal content in academic papers, 2025. URL https://arxiv.org/abs/2504.20115

work page arXiv 2025
[23]

TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

Penghang Liu, Elizabeth Fons, Annita Vapsi, Mohsen Ghassemi, Svitlana Vyetrenko, Daniel Borrajo, Vamsi K. Potluru, and Manuela Veloso. Ts-agent: Understanding and reasoning over raw time series via iterative insight gathering, 2026. URLhttps://arxiv.org/abs/2510.07432

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, September 2024. URLhttp://arxiv.org/ abs/2408.06292. arXiv:2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Prbench: End-to-end paper reproduction in physics research, 2026

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Zihe...

work page arXiv 2026
[26]

Machine learning pipeline for battery state-of-health estimation.Nature Machine Intelligence, 3(5):447–456, 2021

Darius Roman, Saurabh Saxena, Valentin Robu, Michael Pecht, and David Flynn. Machine learning pipeline for battery state-of-health estimation.Nature Machine Intelligence, 3(5):447–456, 2021. doi: 10.1038/s42256-021-00312-3. URLhttps://doi.org/10.1038/s42256-021-00312-3

work page doi:10.1038/s42256-021-00312-3 2021
[27]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants, June 2025. URLhttp://arxiv.org/abs/2501.04227. arXiv:2501.04227

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Paper2Code: Automating code generation from scientific papers in machine learning, 2025

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2Code: Automating code generation from scientific papers in machine learning, 2025. URL https://arxiv.org/abs/2504.17192. To appear in ICLR 2026

work page arXiv 2025
[29]

Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark,
[30]

URLhttps://arxiv.org/abs/2409.11363

work page internal anchor Pith review Pith/arXiv arXiv
[31]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research, 2025. URL https://arxiv.org/abs/ 2504.01848

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

DegradAI: A scalable framework for early battery health diagnosis from limited data.npj Clean Energy, 1(1):8, 2025

Meghana Sudarshan, Jaya Vikeswara Rao Vajja, and Vikas Tomar. DegradAI: A scalable framework for early battery health diagnosis from limited data.npj Clean Energy, 1(1):8, 2025. doi: 10.1038/ s44406-025-00008-2. URLhttps://doi.org/10.1038/s44406-025-00008-2. 11

work page doi:10.1038/s44406-025-00008-2 2025
[33]

Lithium-ion battery RUL prediction based on optimized VMD-SSA-PatchTST algorithm.Scientific Reports, 15:26824, 2025

Pei Tang, Zetao Qiu, Zhongran Yao, Jiahao Pan, Dashuai Cheng, Xiaoyong Gu, and Changcheng Sun. Lithium-ion battery RUL prediction based on optimized VMD-SSA-PatchTST algorithm.Scientific Reports, 15:26824, 2025. doi: 10.1038/s41598-025-11934-7. URL https://doi.org/10.1038/ s41598-025-11934-7

work page doi:10.1038/s41598-025-11934-7 2025
[34]

Daigle, Shankar Sankararaman, Kai Goebel, and Jason Watkins

Christopher Teubert, Matthew J. Daigle, Shankar Sankararaman, Kai Goebel, and Jason Watkins. A Generic Software Architecture for Prognostics (GSAP).International Journal of Prognostics and Health Management, 8(2), November 2020. ISSN 2153-2648, 2153-2648. doi: 10.36001/ijphm.2017.v8i2.2618. URLhttps://papers.phmsociety.org/index.php/ijphm/article/view/2618

work page doi:10.36001/ijphm.2017.v8i2.2618 2020
[35]

ProgPy: Python packages for prognostics and health management of engineering systems.Journal of Open Source Software, 8(87):5099, 2023

Christopher Teubert, Katelyn Jarvis, Matteo Corbetta, Chetan Kulkarni, and Matthew Daigle. ProgPy: Python packages for prognostics and health management of engineering systems.Journal of Open Source Software, 8(87):5099, 2023. doi: 10.21105/joss.05099. URL https://doi.org/10.21105/joss. 05099

work page doi:10.21105/joss.05099 2023
[36]

Minyang Tian, Luyu Gao, et al

Minyang Tian et al. SciCode: A research coding benchmark curated by scientists, 2024. URL https: //arxiv.org/abs/2407.13168

work page arXiv 2024
[37]

CodeRefine: A pipeline for enhancing LLM- generated code implementations of research papers, 2024

Ekaterina Trofimova, Emil Sataev, and Abhijit Singh Jowhari. CodeRefine: A pipeline for enhancing LLM- generated code implementations of research papers, 2024. URL https://arxiv.org/abs/2408.13366

work page arXiv 2024
[38]

Computational reproducibility within prognostics and health management.arXiv preprint arXiv:2205.15489, 2022

Tim von Hahn and Chris K. Mechefske. Computational reproducibility within prognostics and health management, 2022. URLhttps://arxiv.org/abs/2205.15489

work page arXiv 2022
[39]

Ramos-Carre˜no, C

Fujin Wang, Quanquan Zhi, Zhibin Zhao, Zhi Zhai, Yingkai Liu, Huan Xi, Shibin Wang, and Xuefeng Chen. Inherently interpretable physics-informed neural network for battery modeling and prognosis.IEEE Transactions on Neural Networks and Learning Systems, 36(1):1145–1159, 2025. doi: 10.1109/TNNLS. 2023.3329368. URLhttps://doi.org/10.1109/TNNLS.2023.3329368

work page doi:10.1109/tnnls 2025
[40]

Bayesian gated- transformer model for risk-aware prediction of aero-engine remaining useful life.Expert Systems with Applications, 238(Part B):121859, 2024

Zili Wang, Yiming Zhang, Lemiao Qiu, Shuyou Zhang, Joo-Ho Choi, and Feifan Xiang. Bayesian gated- transformer model for risk-aware prediction of aero-engine remaining useful life.Expert Systems with Applications, 238(Part B):121859, 2024. doi: 10.1016/j.eswa.2023.121859. URL https://doi.org/10. 1016/j.eswa.2023.121859

work page doi:10.1016/j.eswa.2023.121859 2024
[41]

Jensen, and Bin Yang

Xingjian Wu, Junkai Lu, Zhengyu Li, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Christian S. Jensen, and Bin Yang. TimeART: Towards Agentic Time Series Reasoning via Tool-Augmentation, January 2026. URL http://arxiv.org/abs/2601.13653. arXiv:2601.13653

work page arXiv 2026
[42]

SciReplicate-Bench: Benchmarking LLMs in agent-driven algorithmic reproduction from research papers, 2025

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He. SciReplicate-Bench: Benchmarking LLMs in agent-driven algorithmic reproduction from research papers, 2025. URL https://arxiv.org/ abs/2504.00255

work page arXiv 2025
[43]

State of health estimation for lithium-ion batteries based on incremental capacity analysis and transformer modeling.Applied Soft Computing, 165:112072, 2024

Zhaofan Xu, Zewang Chen, Lin Yang, and Songyuan Zhang. State of health estimation for lithium-ion batteries based on incremental capacity analysis and transformer modeling.Applied Soft Computing, 165:112072, 2024. doi: 10.1016/j.asoc.2024.112072. URL https://doi.org/10.1016/j.asoc.2024. 112072

work page doi:10.1016/j.asoc.2024.112072 2024
[44]

Battery usage-profile classification via multimodal vision transformer with feature-level fusion and ensemble method.IEEE Access, 13:174664–174683, 2025

V olkan Yamacli. Battery usage-profile classification via multimodal vision transformer with feature-level fusion and ensemble method.IEEE Access, 13:174664–174683, 2025. doi: 10.1109/ACCESS.2025. 3618776. URLhttps://doi.org/10.1109/ACCESS.2025.3618776

work page doi:10.1109/access.2025 2025
[45]

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis.Transactions on Machine Learning Research, December 2025

Wen Ye, Wei Yang, Defu Cao, Yizhou Zhang, Lumingyuan Tang, Jie Cai, and Yan Liu. TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis.Transactions on Machine Learning Research, December 2025. ISSN 2835-8856. URL https://openreview.net/ forum?id=yhy7Vigjcf

2025
[46]

Yujie Zhang, Mingyang Du, and Qiang Miao. CART-Net: A causal adaptive residual time network for remaining useful life prediction of aeroengines under varying operating conditions.IEEE Internet of Things Journal, 12(23):50291–50303, 2025. doi: 10.1109/JIOT.2025.3610545. URL https://doi.org/ 10.1109/JIOT.2025.3610545

work page doi:10.1109/jiot.2025.3610545 2025
[47]

A data augmentation boosted dual informer framework for the performance degradation prediction of aero-engines.IEEE Sensors Journal, 2023

Zhiyao Zhang, Pengpeng Chen, Chenguang Xing, Bo Liu, Ruo Wang, Longxiao Li, Xiaohui Chen, and Enrico Zio. A data augmentation boosted dual informer framework for the performance degradation prediction of aero-engines.IEEE Sensors Journal, 2023. doi: 10.1109/JSEN.2023.3269030. URL https://doi.org/10.1109/JSEN.2023.3269030

work page doi:10.1109/jsen.2023.3269030 2023
[48]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, and Maosong Sun. Autoreproduce: Automatic ai experiment reproduction with paper lineage, 2025. URL https://arxiv.org/abs/2505.20662. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

Jiahui Zhou, Dan Li, Boxin Li, Xiao Zhang, Erli Meng, Lin Li, Zhuomin Chen, Jian Lou, and See-Kiong Ng. Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning, February 2026. URLhttp://arxiv.org/abs/2602.07830. arXiv:2602.07830. 13 A Paper corpus summary Our corpus contains 16 PHM papers selected to ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Picid: A modular evaluation infrastructure for reproducible phm across tasks and domains,

Anonymous. Picid: A modular evaluation infrastructure for reproducible phm across tasks and domains,

[2] [2]

Under anonymous review; citation details blinded for double-blind compliance

[3] [3]

To charge or to sell? EV pack useful life estimation via LSTMs, CNNs, and autoencoders.Energies, 16(6):2837, 2023

Michael Bosello, Carlo Falcomer, Claudio Rossi, and Giovanni Pau. To charge or to sell? EV pack useful life estimation via LSTMs, CNNs, and autoencoders.Energies, 16(6):2837, 2023. doi: 10.3390/ en16062837. URLhttps://doi.org/10.3390/en16062837

work page doi:10.3390/en16062837 2023

[4] [4]

Fault prognosis of turbofan engines: Eventual failure prediction and remaining useful life estimation.International Journal of Prognostics and Health Management, 14(2), 2023

Joseph Cohen, Xun Huan, and Jun Ni. Fault prognosis of turbofan engines: Eventual failure prediction and remaining useful life estimation.International Journal of Prognostics and Health Management, 14(2), 2023. doi: 10.36001/ijphm.2023.v14i2.3486. URL https://doi.org/10.36001/ijphm.2023.v14i2.3486

work page doi:10.36001/ijphm.2023.v14i2.3486 2023

[5] [5]

Marker: Convert documents to markdown, JSON, chunks, and HTML

Datalab. Marker: Convert documents to markdown, JSON, chunks, and HTML. https://github.com/ datalab-to/marker, 2026. URL https://github.com/datalab-to/marker. Software repository; accessed 2026-05-27

2026

[6] [6]

Ingeborg de Pater and Mihaela Mitici. Developing health indicators and RUL prognostics for systems with few failure instances and varying operating conditions using a LSTM autoencoder.Engineering Applications of Artificial Intelligence, 117:105582, 2023. doi: 10.1016/j.engappai.2022.105582. URL https://doi.org/10.1016/j.engappai.2022.105582

work page doi:10.1016/j.engappai.2022.105582 2023

[7] [7]

Causal inference-based fault diagnosis and abnormal degradation detection for aero-engine

Kunyu Dong, Dan Xu, Zhaoyang Zeng, and Qingyu Zhu. Causal inference-based fault diagnosis and abnormal degradation detection for aero-engine. InProceedings of the 2025 International Conference on Equipment Intelligent Operation and Maintenance, pages 1422–1428, 2025. doi: 10.1109/ICEIOM65271. 2025.11239779. URLhttps://doi.org/10.1109/ICEIOM65271.2025.11239779

work page doi:10.1109/iceiom65271 2025

[8] [8]

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Kuruvilla, Rachel Johnson, and Michio Inoue

Russell Graves, Peeyush Pankaj, Vineet J. Kuruvilla, Rachel Johnson, and Michio Inoue. Data-driven prognostics and diagnostics of industrial machinery – a turbofan engine case study. InProceedings of the Asia Pacific Conference of the PHM Society, 2023. doi: 10.36001/phmap.2023.v4i1.3690. URL https://doi.org/10.36001/phmap.2023.v4i1.3690

work page doi:10.36001/phmap.2023.v4i1.3690 2023

[10] [10]

Argos: Agentic time-series anomaly detection with autonomous rule generation via large language models, 2025

Yile Gu et al. Argos: Agentic time-series anomaly detection with autonomous rule generation via large language models, 2025. URLhttps://arxiv.org/abs/2501.14170

work page arXiv 2025

[11] [11]

A comparison of residual-based methods on fault detection

Chi-Ching Hsu, Gaetan Frusque, and Olga Fink. A comparison of residual-based methods on fault detection. InAnnual Conference of the PHM Society, 2023. doi: 10.36001/phmconf.2023.v15i1.3444. URLhttps://doi.org/10.36001/phmconf.2023.v15i1.3444

work page doi:10.36001/phmconf.2023.v15i1.3444 2023

[12] [12]

Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber

Tianyu Hua, Harper Hua, Violet Xiang, Benjamin Klieger, Sang T. Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber. Researchcodebench: Benchmarking llms on implementing novel machine learning research code, 2025. URLhttps://arxiv.org/abs/2506.02314

work page arXiv 2025

[13] [13]

AutoIAD: Manager-driven multi-agent collaboration for automated industrial anomaly detection, 2025

Dongwei Ji, Bingzhang Hu, and Yi Zhou. AutoIAD: Manager-driven multi-agent collaboration for automated industrial anomaly detection, 2025. URLhttps://arxiv.org/abs/2508.05503

work page arXiv 2025

[14] [14]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: AI-Driven Exploration in the Space of Code, February 2025. URL http://arxiv.org/abs/ 2502.13138. arXiv:2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

A recipe for training neural networks

Andrej Karpathy. A recipe for training neural networks. https://karpathy.github.io/2019/04/ 25/recipe/, 2019. Blog post; accessed 2026-04-28. 10

2019

[16] [16]

From reproduction to replication: Evaluating research agents with progressive code masking, 2025

Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, and Daniel Fried. From reproduction to replication: Evaluating research agents with progressive code masking, 2025. URL https://arxiv. org/abs/2506.19724

work page arXiv 2025

[17] [17]

Minh Le-Anh, Huyen Nguyen, Khanh An Tran, Nam Le Hai, Linh Ngo Van, Nghi D. Q. Bui, and Bach Le. Do not treat code as natural language: Implications for repository-level code generation and beyond, 2026. URLhttps://arxiv.org/abs/2602.11671

work page arXiv 2026

[18] [18]

PHM-Vibench: A unified and factory-style vibration benchmarking framework for the foundation model era

Qi Li, Bojian Chen, Xuan Li, Qitong Chen, Liang Chen, Changqing Shen, Lu Lu, Zhaoye Qin, and Fulei Chu. PHM-Vibench: A unified and factory-style vibration benchmarking framework for the foundation model era. InProceedings of the Asia Pacific Conference of the PHM Society 2025, 2025. URL https: //papers.phmsociety.org/index.php/phmap/article/view/4303

2025

[19] [19]

Deepcode: Open agentic coding,

Zongwei Li, Zhonghang Li, Zirui Guo, Xubin Ren, and Chao Huang. Deepcode: Open agentic coding,

[20] [20]

URLhttps://arxiv.org/abs/2512.07921

work page arXiv

[21] [21]

Lin Lin, Jinlei Wu, Song Fu, Sihao Zhang, Changsheng Tong, and Lizheng Zu. Channel attention & temporal attention based temporal convolutional network: A dual attention framework for remaining useful life prediction of the aircraft engines.Advanced Engineering Informatics, 60:102372, 2024. doi: 10.1016/j.aei.2024.102372. URLhttps://doi.org/10.1016/j.aei.2...

work page doi:10.1016/j.aei.2024.102372 2024

[22] [22]

Autop2c: An llm-based agent framework for code repository generation from multimodal content in academic papers, 2025

Zijie Lin, Yiqing Shen, Qilin Cai, He Sun, Jinrui Zhou, and Mingjun Xiao. Autop2c: An llm-based agent framework for code repository generation from multimodal content in academic papers, 2025. URL https://arxiv.org/abs/2504.20115

work page arXiv 2025

[23] [23]

TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

Penghang Liu, Elizabeth Fons, Annita Vapsi, Mohsen Ghassemi, Svitlana Vyetrenko, Daniel Borrajo, Vamsi K. Potluru, and Manuela Veloso. Ts-agent: Understanding and reasoning over raw time series via iterative insight gathering, 2026. URLhttps://arxiv.org/abs/2510.07432

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, September 2024. URLhttp://arxiv.org/ abs/2408.06292. arXiv:2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Prbench: End-to-end paper reproduction in physics research, 2026

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Zihe...

work page arXiv 2026

[26] [26]

Machine learning pipeline for battery state-of-health estimation.Nature Machine Intelligence, 3(5):447–456, 2021

Darius Roman, Saurabh Saxena, Valentin Robu, Michael Pecht, and David Flynn. Machine learning pipeline for battery state-of-health estimation.Nature Machine Intelligence, 3(5):447–456, 2021. doi: 10.1038/s42256-021-00312-3. URLhttps://doi.org/10.1038/s42256-021-00312-3

work page doi:10.1038/s42256-021-00312-3 2021

[27] [27]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants, June 2025. URLhttp://arxiv.org/abs/2501.04227. arXiv:2501.04227

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Paper2Code: Automating code generation from scientific papers in machine learning, 2025

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2Code: Automating code generation from scientific papers in machine learning, 2025. URL https://arxiv.org/abs/2504.17192. To appear in ICLR 2026

work page arXiv 2025

[29] [29]

Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark,

[30] [30]

URLhttps://arxiv.org/abs/2409.11363

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research, 2025. URL https://arxiv.org/abs/ 2504.01848

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

DegradAI: A scalable framework for early battery health diagnosis from limited data.npj Clean Energy, 1(1):8, 2025

Meghana Sudarshan, Jaya Vikeswara Rao Vajja, and Vikas Tomar. DegradAI: A scalable framework for early battery health diagnosis from limited data.npj Clean Energy, 1(1):8, 2025. doi: 10.1038/ s44406-025-00008-2. URLhttps://doi.org/10.1038/s44406-025-00008-2. 11

work page doi:10.1038/s44406-025-00008-2 2025

[33] [33]

Lithium-ion battery RUL prediction based on optimized VMD-SSA-PatchTST algorithm.Scientific Reports, 15:26824, 2025

Pei Tang, Zetao Qiu, Zhongran Yao, Jiahao Pan, Dashuai Cheng, Xiaoyong Gu, and Changcheng Sun. Lithium-ion battery RUL prediction based on optimized VMD-SSA-PatchTST algorithm.Scientific Reports, 15:26824, 2025. doi: 10.1038/s41598-025-11934-7. URL https://doi.org/10.1038/ s41598-025-11934-7

work page doi:10.1038/s41598-025-11934-7 2025

[34] [34]

Daigle, Shankar Sankararaman, Kai Goebel, and Jason Watkins

Christopher Teubert, Matthew J. Daigle, Shankar Sankararaman, Kai Goebel, and Jason Watkins. A Generic Software Architecture for Prognostics (GSAP).International Journal of Prognostics and Health Management, 8(2), November 2020. ISSN 2153-2648, 2153-2648. doi: 10.36001/ijphm.2017.v8i2.2618. URLhttps://papers.phmsociety.org/index.php/ijphm/article/view/2618

work page doi:10.36001/ijphm.2017.v8i2.2618 2020

[35] [35]

ProgPy: Python packages for prognostics and health management of engineering systems.Journal of Open Source Software, 8(87):5099, 2023

Christopher Teubert, Katelyn Jarvis, Matteo Corbetta, Chetan Kulkarni, and Matthew Daigle. ProgPy: Python packages for prognostics and health management of engineering systems.Journal of Open Source Software, 8(87):5099, 2023. doi: 10.21105/joss.05099. URL https://doi.org/10.21105/joss. 05099

work page doi:10.21105/joss.05099 2023

[36] [36]

Minyang Tian, Luyu Gao, et al

Minyang Tian et al. SciCode: A research coding benchmark curated by scientists, 2024. URL https: //arxiv.org/abs/2407.13168

work page arXiv 2024

[37] [37]

CodeRefine: A pipeline for enhancing LLM- generated code implementations of research papers, 2024

Ekaterina Trofimova, Emil Sataev, and Abhijit Singh Jowhari. CodeRefine: A pipeline for enhancing LLM- generated code implementations of research papers, 2024. URL https://arxiv.org/abs/2408.13366

work page arXiv 2024

[38] [38]

Computational reproducibility within prognostics and health management.arXiv preprint arXiv:2205.15489, 2022

Tim von Hahn and Chris K. Mechefske. Computational reproducibility within prognostics and health management, 2022. URLhttps://arxiv.org/abs/2205.15489

work page arXiv 2022

[39] [39]

Ramos-Carre˜no, C

Fujin Wang, Quanquan Zhi, Zhibin Zhao, Zhi Zhai, Yingkai Liu, Huan Xi, Shibin Wang, and Xuefeng Chen. Inherently interpretable physics-informed neural network for battery modeling and prognosis.IEEE Transactions on Neural Networks and Learning Systems, 36(1):1145–1159, 2025. doi: 10.1109/TNNLS. 2023.3329368. URLhttps://doi.org/10.1109/TNNLS.2023.3329368

work page doi:10.1109/tnnls 2025

[40] [40]

Bayesian gated- transformer model for risk-aware prediction of aero-engine remaining useful life.Expert Systems with Applications, 238(Part B):121859, 2024

Zili Wang, Yiming Zhang, Lemiao Qiu, Shuyou Zhang, Joo-Ho Choi, and Feifan Xiang. Bayesian gated- transformer model for risk-aware prediction of aero-engine remaining useful life.Expert Systems with Applications, 238(Part B):121859, 2024. doi: 10.1016/j.eswa.2023.121859. URL https://doi.org/10. 1016/j.eswa.2023.121859

work page doi:10.1016/j.eswa.2023.121859 2024

[41] [41]

Jensen, and Bin Yang

Xingjian Wu, Junkai Lu, Zhengyu Li, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Christian S. Jensen, and Bin Yang. TimeART: Towards Agentic Time Series Reasoning via Tool-Augmentation, January 2026. URL http://arxiv.org/abs/2601.13653. arXiv:2601.13653

work page arXiv 2026

[42] [42]

SciReplicate-Bench: Benchmarking LLMs in agent-driven algorithmic reproduction from research papers, 2025

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He. SciReplicate-Bench: Benchmarking LLMs in agent-driven algorithmic reproduction from research papers, 2025. URL https://arxiv.org/ abs/2504.00255

work page arXiv 2025

[43] [43]

State of health estimation for lithium-ion batteries based on incremental capacity analysis and transformer modeling.Applied Soft Computing, 165:112072, 2024

Zhaofan Xu, Zewang Chen, Lin Yang, and Songyuan Zhang. State of health estimation for lithium-ion batteries based on incremental capacity analysis and transformer modeling.Applied Soft Computing, 165:112072, 2024. doi: 10.1016/j.asoc.2024.112072. URL https://doi.org/10.1016/j.asoc.2024. 112072

work page doi:10.1016/j.asoc.2024.112072 2024

[44] [44]

Battery usage-profile classification via multimodal vision transformer with feature-level fusion and ensemble method.IEEE Access, 13:174664–174683, 2025

V olkan Yamacli. Battery usage-profile classification via multimodal vision transformer with feature-level fusion and ensemble method.IEEE Access, 13:174664–174683, 2025. doi: 10.1109/ACCESS.2025. 3618776. URLhttps://doi.org/10.1109/ACCESS.2025.3618776

work page doi:10.1109/access.2025 2025

[45] [45]

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis.Transactions on Machine Learning Research, December 2025

Wen Ye, Wei Yang, Defu Cao, Yizhou Zhang, Lumingyuan Tang, Jie Cai, and Yan Liu. TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis.Transactions on Machine Learning Research, December 2025. ISSN 2835-8856. URL https://openreview.net/ forum?id=yhy7Vigjcf

2025

[46] [46]

Yujie Zhang, Mingyang Du, and Qiang Miao. CART-Net: A causal adaptive residual time network for remaining useful life prediction of aeroengines under varying operating conditions.IEEE Internet of Things Journal, 12(23):50291–50303, 2025. doi: 10.1109/JIOT.2025.3610545. URL https://doi.org/ 10.1109/JIOT.2025.3610545

work page doi:10.1109/jiot.2025.3610545 2025

[47] [47]

A data augmentation boosted dual informer framework for the performance degradation prediction of aero-engines.IEEE Sensors Journal, 2023

Zhiyao Zhang, Pengpeng Chen, Chenguang Xing, Bo Liu, Ruo Wang, Longxiao Li, Xiaohui Chen, and Enrico Zio. A data augmentation boosted dual informer framework for the performance degradation prediction of aero-engines.IEEE Sensors Journal, 2023. doi: 10.1109/JSEN.2023.3269030. URL https://doi.org/10.1109/JSEN.2023.3269030

work page doi:10.1109/jsen.2023.3269030 2023

[48] [48]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, and Maosong Sun. Autoreproduce: Automatic ai experiment reproduction with paper lineage, 2025. URL https://arxiv.org/abs/2505.20662. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

Jiahui Zhou, Dan Li, Boxin Li, Xiao Zhang, Erli Meng, Lin Li, Zhuomin Chen, Jian Lou, and See-Kiong Ng. Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning, February 2026. URLhttp://arxiv.org/abs/2602.07830. arXiv:2602.07830. 13 A Paper corpus summary Our corpus contains 16 PHM papers selected to ...

work page internal anchor Pith review Pith/arXiv arXiv 2026