arxiv: 2604.22673 · v1 · submitted 2026-04-24 · 💻 cs.SE · cs.SC

Recognition: unknown

Inferring Equivalence Classes from Legacy Undocumented Embedded Binaries for ISO 26262-Compliant Testing

Marco De Luca , Domenico Francesco De Angelis , Domenico Amalfitano , Pasquale Cimmino , Anna Rita Fasolino

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:18 UTC · model grok-4.3

classification 💻 cs.SE cs.SC

keywords equivalence class partitioningbinary analysisembedded firmwaresymbolic executionISO 26262legacy software testingcontrol flow reconstructiontest design

0 comments

The pith

Control-flow reconstruction plus guided symbolic execution can infer equivalence classes directly from undocumented embedded binaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a binary-level technique to perform equivalence class partitioning on legacy firmware that lacks source code or specifications. It reconstructs control flow for each function and uses guided symbolic execution to identify groups of paths that produce identical observable outputs such as return values and output parameters. The resulting classes support the systematic test design required by safety standards like ISO 26262. An optional step converts the classes into human-readable form, and an industrial automotive study found that the automatically derived classes closely matched what expert engineers expected while also aiding comprehension.

Core claim

The methodology infers output-oriented equivalence classes by analyzing individual functions through control-flow reconstruction and guided symbolic execution, grouping execution paths according to indistinguishable observable behavior including return values and output parameters. An optional post-processing step produces human-readable representations to support comprehension and documentation. Evaluation in an industrial automotive context shows strong alignment with expert expectations and positive perception of readability and usefulness for function understanding and test design.

What carries the argument

The combination of control-flow reconstruction and guided symbolic execution that groups execution paths by identical observable outputs at the binary level.

Load-bearing premise

Observable outputs captured at the binary level (return values and output parameters) are sufficient to define equivalence classes that match human expert judgment of functional behavior.

What would settle it

Independent experts manually partitioning the same set of functions into equivalence classes and finding that the binary-inferred classes agree with expert partitions in fewer than 70 percent of cases.

Figures

Figures reproduced from arXiv: 2604.22673 by Anna Rita Fasolino, Domenico Amalfitano, Domenico Francesco De Angelis, Marco De Luca, Pasquale Cimmino.

**Figure 1.** Figure 1: map file (textual) ELF enrich by DWARF map file parser call-graph retrieve metrics extract CFG extract debug info Grouping into Clusters metrics Debug Info LEGEND ANGR Framework JSON file Cluster of Function CFGs CFG view at source ↗

**Figure 2.** Figure 2: Phase 2 overview. Clusters are processed in ascending order of call depth and, in case of ties, by increasing number of accessed global variables. This scheduling reduces symbolic execution overhead, as analyzing shallower functions first enables the construction of reusable method summaries [36, 37], which are subsequently reused when exploring deeper call chains. A method summary captures the interface-… view at source ↗

**Figure 3.** Figure 3: Code, CFG, and symbolic path conditions of view at source ↗

**Figure 4.** Figure 4: Code, CFG, and symbolic path conditions of view at source ↗

**Figure 5.** Figure 5: Source code, CFG, and symbolic constraints for view at source ↗

**Figure 6.** Figure 6: Likert-scale rating distribution for Q1, Q2, Q3 view at source ↗

**Figure 7.** Figure 7: Likert-scale rating distribution for Q5, Q8, Q9 view at source ↗

read the original abstract

Equivalence class partitioning is a well-established test design technique mandated by safety standards such as ISO~26262 for systematic testing of safety software. In industrial practice, however, its application to legacy undocumented embedded firmware is often hindered by incomplete or outdated functional specifications. This paper proposes a binary-level methodology for inferring output-oriented equivalence classes directly from compiled firmware, without relying on source-level annotations or external documentation. The approach combines control-flow reconstruction and guided symbolic execution to analyze individual functions and group execution paths according to indistinguishable observable behavior, including return values and output parameters. An optional post-processing step produces human-readable representations to support comprehension and documentation. The methodology is evaluated in an industrial automotive context through a practitioner-based study assessing correctness and interpretability. Results indicate strong alignment with expert expectations and a positive perception of readability and usefulness for supporting function understanding and test design. These findings demonstrate the feasibility and practical relevance of binary-level equivalence class inference for systematic testing of legacy undocumented safety-embedded software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sketches a binary analysis pipeline to infer output-based equivalence classes from legacy embedded firmware for ISO 26262 testing, but the evaluation stays qualitative and the method risks missing hardware side-effects.

read the letter

The paper's core idea is to take undocumented compiled firmware, rebuild its control flow, and run guided symbolic execution to cluster paths that produce the same return values and output parameters. An extra step turns those clusters into readable descriptions for testers. This directly targets the gap in legacy automotive code where specs are missing or outdated, which matters for regulated safety testing.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a binary-level methodology for inferring output-oriented equivalence classes from legacy undocumented embedded firmware. It combines control-flow reconstruction and guided symbolic execution to analyze individual functions and group execution paths by indistinguishable observable behavior (return values and output parameters). An optional post-processing step generates human-readable representations. The approach is evaluated via a practitioner-based industrial study in an automotive context, claiming strong alignment with expert expectations and positive perceptions of readability and usefulness for supporting ISO 26262-compliant test design and function comprehension.

Significance. If the central claim holds, the work would address a practical gap in applying mandated equivalence class partitioning to legacy safety-critical embedded systems lacking documentation. The practitioner study provides direct evidence of industrial relevance and interpretability, which is a strength for a methodology paper in software engineering for safety standards. However, the absence of quantitative metrics or detailed validation protocols in the reported results limits the assessed impact.

major comments (2)

[Evaluation section] Evaluation section: The abstract and study description claim 'strong alignment with expert expectations' without providing quantitative metrics (e.g., agreement percentages, inter-rater reliability, or error rates), details on how equivalence classes were validated against expert judgments, or the number of participants/functions examined. This is load-bearing for the central claim that the inferred classes support ISO 26262-compliant testing.
[Methodology (control-flow and symbolic execution description)] Methodology (control-flow and symbolic execution description): Equivalence classes are defined solely by return values and designated output parameters. In legacy automotive firmware, functional behavior frequently depends on writes to memory-mapped peripherals, DMA buffers, or global state that influence hardware behavior; these are invisible to the analysis unless explicitly modeled as outputs. The paper does not demonstrate or discuss how such side-effects are captured or why they can be safely ignored, undermining the claim that grouped paths exhibit indistinguishable observable system-level behavior.

minor comments (1)

[Abstract] The abstract refers to 'positive perception of readability and usefulness' but omits study design details such as participant count, task instructions, or how readability was measured, which would improve clarity and allow readers to assess generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our methodology and evaluation. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation section] The abstract and study description claim 'strong alignment with expert expectations' without providing quantitative metrics (e.g., agreement percentages, inter-rater reliability, or error rates), details on how equivalence classes were validated against expert judgments, or the number of participants/functions examined. This is load-bearing for the central claim that the inferred classes support ISO 26262-compliant testing.

Authors: We agree that the evaluation section would benefit from greater transparency regarding the practitioner study design. The study was qualitative in nature, relying on expert review and feedback from automotive practitioners to assess alignment, readability, and usefulness rather than statistical measures. To address this, we will revise the evaluation section to explicitly report the number of participants, the number of functions examined, and a detailed description of the validation protocol (including how experts compared inferred classes against their expectations through structured reviews and discussions). We will also qualify the claim of 'strong alignment' to reflect the qualitative basis without implying quantitative validation, ensuring the central claim is appropriately supported by the reported evidence. revision: yes
Referee: [Methodology (control-flow and symbolic execution description)] Equivalence classes are defined solely by return values and designated output parameters. In legacy automotive firmware, functional behavior frequently depends on writes to memory-mapped peripherals, DMA buffers, or global state that influence hardware behavior; these are invisible to the analysis unless explicitly modeled as outputs. The paper does not demonstrate or discuss how such side-effects are captured or why they can be safely ignored, undermining the claim that grouped paths exhibit indistinguishable observable system-level behavior.

Authors: The methodology intentionally focuses on output-oriented equivalence classes derived from return values and explicitly designated output parameters, as these represent the observable interface for many functions in the target firmware and align with standard test design practices under ISO 26262. However, we acknowledge that side effects to memory-mapped peripherals, DMA, and global state are relevant in embedded automotive contexts. The current analysis treats such effects as out of scope unless they propagate to the designated outputs; no explicit modeling of hardware state is performed. We will revise the methodology section to clearly state this scope limitation, explain the rationale for focusing on designated outputs, and discuss potential extensions (such as user-specified memory regions) for future work. This ensures the claims about indistinguishable observable behavior are appropriately bounded. revision: partial

Circularity Check

0 steps flagged

No circularity: methodology proposal with independent empirical evaluation

full rationale

The paper describes a binary analysis methodology that combines control-flow reconstruction and guided symbolic execution to partition functions into equivalence classes based on observable outputs. No equations, fitted parameters, or derivation steps are present that reduce claims to self-referential inputs. The central results come from a separate practitioner-based industrial study assessing alignment with expert judgment, which is external to the method definition itself. No load-bearing self-citations or ansatzes imported from prior author work are invoked to justify the approach.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from binary analysis rather than new fitted parameters or invented entities; no free parameters are introduced.

axioms (2)

domain assumption Control-flow reconstruction from binaries accurately recovers function-level structure and paths for the target embedded firmware.
Invoked when the approach is described as combining control-flow reconstruction with symbolic execution on individual functions.
domain assumption Observable outputs (return values and output parameters) suffice to determine functional equivalence for testing purposes.
Core to grouping paths by indistinguishable observable behavior.

pith-pipeline@v0.9.0 · 5487 in / 1358 out tokens · 27363 ms · 2026-05-08T11:18:01.367467+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 13 canonical work pages

[1]

Elvira Albert, Puri Arenas, Miguel Gómez-Zamalloa, and Jose Miguel Rojas. 2014. Test Case Generation by Symbolic Execution: Basic Concepts, a CLP-Based Instance, and Actor-Based Concurrency. Springer International Publishing, Cham, 263–309. doi:10.1007/978-3-319-07317-0_7

work page doi:10.1007/978-3-319-07317-0_7 2014
[2]

ELVIRA ALBERT, MARÍA GARCÍA DE LA BANDA, MIGUEL GÓMEZ- ZAMALLOA, JOSÉ MIGUEL ROJAS, and PETER STUCKEY. 2013. A CLP heap solver for test case generation.Theory and Practice of Logic Programming13, 4–5 (2013), 721–735. doi:10.1017/S1471068413000458

work page doi:10.1017/s1471068413000458 2013
[3]

Roberto Amadini, Mak Andrlon, Graeme Gange, Peter Schachte, Harald Sønder- gaard, and Peter J. Stuckey. 2019. Constraint Programming for Dynamic Symbolic Execution of JavaScript. InIntegration of Constraint Programming, Artificial In- telligence, and Operations Research, Louis-Martin Rousseau and Kostas Stergiou (Eds.). Springer International Publishing, ...

2019
[4]

Glenn Ammons, Rastislav Bodík, and James R. Larus. 2002. Mining specifications. SIGPLAN Not.37, 1 (Jan. 2002), 4–16. doi:10.1145/565816.503275

work page doi:10.1145/565816.503275 2002
[5]

2020.Creating Human Readable Path Constraints from Symbolic Execution.Technical Report

Tod Tracy Amon and Timothy James Loffredo. 2020.Creating Human Readable Path Constraints from Symbolic Execution.Technical Report. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States)

2020
[6]

Burke, Tsong Yueh Chen, John Clark, Myra B

Saswat Anand, Edmund K. Burke, Tsong Yueh Chen, John Clark, Myra B. Cohen, Wolfgang Grieskamp, Mark Harman, Mary Jean Harrold, Phil McMinn, Antonia Bertolino, J. Jenny Li, and Hong Zhu. 2013. An orchestrated survey of methodolo- gies for automated software test case generation.Journal of Systems and Software 86, 8 (2013), 1978–2001. doi:10.1016/j.jss.2013.02.061

work page doi:10.1016/j.jss.2013.02.061 2013
[7]

Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224

2008
[8]

Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. 2011. S2E: A platform for in-vivo multi-path analysis of software systems.Acm Sigplan Notices 46, 3 (2011), 265–278

2011
[9]

International Electrotechnical Commission. 2010. Functional safety of electrical/- electronic/programmable electronic safety-related systems

2010
[10]

2017.DW ARF Debugging Information Format Version 5

DWARF Debugging Information Format Committee. 2017.DW ARF Debugging Information Format Version 5. DWARF Committee. https://dwarfstd.org/doc/ DWARF5.pdf Available under the GNU Free Documentation License, Version 1.3

2017
[11]

Yiming Fan and Meng Wang. 2024. Specification mining based on the order- ing points to identify the clustering structure clustering algorithm and model checking.Algorithms17, 1 (2024), 28

2024
[12]

Vahid Garousi, Michael Felderer, Çağrı Murat Karapıçak, and Uğur Yılmaz. 2018. Testing embedded software: A survey of the literature.Information and Software Technology104 (2018), 14–45. doi:10.1016/j.infsof.2018.06.016

work page doi:10.1016/j.infsof.2018.06.016 2018
[13]

MIGUEL GÓMEZ-ZAMALLOA, ELVIRA ALBERT, and GERMÁN PUEBLA. 2010. Test case generation for object-oriented imperative languages in CLP.The- ory and Practice of Logic Programming10, 4–6 (2010), 659–674. doi:10.1017/ S1471068410000347

2010
[14]

Gábor Horváth, Réka Kovács, and Zoltán Porkoláb. 2024. Scaling Symbolic Execution to Large Software Systems.ArXivabs/2408.01909 (2024). https: //api.semanticscholar.org/CorpusID:271709984

work page arXiv 2024
[15]

Wen-Ling Huang and Jan Peleska. 2016. Complete model-based equivalence class testing.Int. J. Softw. Tools Technol. Transf.18, 3 (2016), 265–283. doi:10.1007/ s10009-014-0356-8

2016
[16]

Felix Hübner, Wen ling Huang, and Jan Peleska. 2015. Experimental Evaluation of a Novel Equivalence Class Partition Testing Strategy. InTests and Proofs (TAP 2015) (Lecture Notes in Computer Science, Vol. 9154). Springer, 155–172. doi:10.1007/978-3-319-21215-9_10

work page doi:10.1007/978-3-319-21215-9_10 2015
[17]

ISO. 2018. ISO 26262 — Road vehicles — Functional safety — Part 6: Product development at the software level

2018
[18]

ISO. 2018. ISO 26262 — Road vehicles — Functional safety — Part 8: Supporting processes

2018
[19]

2018.Road vehicles – Functional safety – Part 1: Vocabulary

ISO/TC 22/SC 32. 2018.Road vehicles – Functional safety – Part 1: Vocabulary. Standard ISO 26262-1:2018 to ISO 26262-12:2018. International Organization for Standardization, Geneva, Switzerland. https://www.iso.org/standard/68383.html

2018
[20]

Joxan Jaffar, Jorge A Navas, and Andrew E Santosa. 2011. Unbounded sym- bolic execution for program verification. InInternational Conference on Runtime Verification. Springer, 396–411

2011
[21]

Anastasis Keliris and Michail Maniatakos. 2018. ICSREF: A framework for auto- mated reverse engineering of industrial control systems binaries.arXiv preprint arXiv:1812.03478(2018)

work page arXiv 2018
[22]

Taegyu Kim, Aolin Ding, Sriharsha Etigowni, Pengfei Sun, Jizhou Chen, Luis Gar- cia, Saman Zonouz, Dongyan Xu, and Dave Tian. 2022. Reverse engineering and retrofitting robotic aerial vehicle control firmware using dispatch. InProceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 69–83

2022
[23]

2011.Specification mining: A concise introduction

David Lo, {Siau Cheng} Khoo, Chao Liu, and Jiawei Han. 2011.Specification mining: A concise introduction. CRC Press, 1–27. Publisher Copyright:©2011 by Taylor & Francis Group, LLC

2011
[24]

Ramakrishna, K

Srinivas Malladi, G. Ramakrishna, K. Rao, and E Babu. 2016. Analysis of Legacy System in Software Application Development: A Comparative Survey.Interna- tional Journal of Electrical and Computer Engineering (IJECE)6 (02 2016), 292–297. doi:10.11591/ijece.v6i1.8367

work page doi:10.11591/ijece.v6i1.8367 2016
[25]

Bertrand Meyer. 2024. A formal definition of loop unrolling with applications to test coverage.arXiv preprint arXiv:2403.08923(2024)

work page arXiv 2024
[26]

2004.The art of software testing

Glenford J Myers, Tom Badgett, Todd M Thomas, and Corey Sandler. 2004.The art of software testing. Vol. 2. Wiley Online Library

2004
[27]

QA Systems. 2020. Automating Requirements-Based Testing for ISO 26262. https://www.qa-systems.com/wp-content/uploads/2020/12/automating- requirements-based-testing-for-iso-26262.pdf

2020
[28]

Abdullah Qasem, Paria Shirani, Mourad Debbabi, Lingyu Wang, Bernard Lebel, and Basile L. Agba. 2018. Automatic Vulnerability Detection in Embedded Device Firmware and Binary Code: Survey and Layered Taxonomies.ACM Computing Surveys — Extended Version1, 1 (2018), 1–40. doi:10.1145/1122445.1122456

work page doi:10.1145/1122445.1122456 2018
[29]

Yuya Sasaki, Hironori Washizaki, Jialong Li, Nobukazu Yoshioka, Naoyasu Ubayashi, and Yoshiaki Fukazawa. 2025. Landscape and Taxonomy of Prompt Engineering Patterns in Software Engineering.IT Professional27, 1 (2025), 41–49. doi:10.1109/MITP.2024.3525458

work page doi:10.1109/mitp.2024.3525458 2025
[30]

Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Audrey Dutcher, Jessie Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, and Giovanni Vigna. 2016. SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis. (2016)

2016
[31]

Pengfei Sun, Luis Garcia, and Saman Zonouz. 2019. Tell me more than just assembly! reversing cyber-physical execution semantics of embedded iot con- troller software binaries. In2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 349–361

2019
[32]

Tool Interface Standards (TIS) Committee

Tool Interface Standards (TIS) Committee 1995.Tool Interface Standard (TIS) Executable and Linking Format (ELF) Specification. Tool Interface Standards (TIS) Committee. https://refspecs.linuxfoundation.org/elf/elf.pdf Accessed: 2026-01-20

1995
[33]

Meet Udeshi, Prashanth Krishnamurthy, Hammond Pearce, Ramesh Karri, and Farshad Khorrami. 2024. REMaQE: Reverse Engineering Math Equations from Executables.ACM Trans. Cyber-Phys. Syst.8, 4, Article 43 (Nov. 2024), 25 pages. doi:10.1145/3699674

work page doi:10.1145/3699674 2024
[34]

Nicolaas Weideman, Virginia K Felkner, Wei-Cheng Wu, Jonathan May, Christophe Hauser, and Luis Garcia. 2021. Perfume: Programmatic extraction and refinement for usability of mathematical expression. InProceedings of the 2021 Research on offensive and defensive techniques in the Context of Man At The End (MATE) Attacks. 59–69

2021
[35]

Dageförde, and Herbert Kuchen

Hendrik Winkelmann, Jan C. Dageförde, and Herbert Kuchen. 2021. Constraint- Logic Object-Oriented Programming with Free Arrays. InFunctional and Con- straint Logic Programming, Michael Hanus and Claudio Sacerdoti Coen (Eds.). Springer International Publishing, Cham, 129–144

2021
[36]

Qiuping Yi, Junye Wen, and Guowei Yang. 2020. Summary-guided incremental symbolic execution. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings. 310–311

2020
[37]

Yicheng Zeng, Jiaqian Peng, Jiami Lin, Rongrong Xi, and Hongsong Zhu. 2024. Symerge: Replacing Calls in Under-Constrained Symbolic Execution and Find Vul- nerabilities. InSecurity and Privacy in Communication Networks, Saed Alrabaee, Kim-Kwang Raymond Choo, Ernesto Damiani, and Robert H. Deng (Eds.). Springer Nature Switzerland, Cham, 376–399

2024