REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange
Pith reviewed 2026-06-28 04:51 UTC · model grok-4.3
The pith
REStack dataset shows reverse engineering discussions focus on practical debugging and decompilation while memory and firmware topics remain difficult to resolve.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A collection of over 12,000 RE posts can be reduced to 23 coherent topics that demonstrate RE practice is overwhelmingly practical and task-oriented, with debugging, decompilation, and system-level analysis dominating, while memory, firmware, and file-format analysis show elevated difficulty and unresolved rates.
What carries the argument
The REStack dataset, assembled by collecting posts from two Stack Exchange sites and then processed with LDA topic modeling whose hyperparameters were tuned by genetic algorithm, followed by manual labeling into six thematic categories and enrichment with community-derived difficulty metadata.
If this is right
- Empirical researchers gain a reusable corpus for measuring how RE challenges evolve over time.
- Educators obtain concrete topic lists for designing targeted training on high-difficulty areas.
- Developers of AI assistance tools receive labeled examples and difficulty signals for training and evaluation.
- Tool builders can prioritize support for memory and firmware analysis based on the observed unresolved rates.
Where Pith is reading between the lines
- The same collection method could be applied to other narrow software-engineering domains to produce comparable difficulty maps.
- Difficulty signals derived from unanswered rates could be tested as predictors of which RE questions would benefit most from automated help.
- The topic structure offers a starting point for defining benchmark tasks that future RE tools must handle.
Load-bearing premise
That the combination of LDA with genetic-algorithm tuning and subsequent manual labeling yields 23 topics that faithfully reflect the actual distribution of challenges in the collected posts.
What would settle it
A fresh run of the same modeling pipeline on the identical post collection that produces a markedly different set of topics or that shows low human agreement on the manual labels.
read the original abstract
Reverse engineering (RE) is a critical activity in software engineering and cybersecurity, supporting tasks such as malware analysis, vulnerability discovery, legacy system maintenance, and firmware inspection. Despite its importance, there is limited empirical understanding of the challenges, topics, and knowledge gaps faced by RE practitioners in real-world settings, and no publicly available dataset has systematically captured RE discussions from developer Q&A forums. In this paper, we present REStack, a large-scale dataset of RE discussions collected from Stack Overflow and the dedicated Reverse Engineering Stack Exchange site. The dataset comprises over 12,000 RE-related posts spanning more than 15 years. Using Latent Dirichlet Allocation (LDA) with Genetic Algorithm (GA)-based hyperparameter optimization, followed by manual topic labeling, we identify 23 semantically coherent RE topics grouped into six high-level thematic categories. The dataset is further enriched with metadata and difficulty indicators derived from community interaction signals, such as unanswered rates and response times. Our analysis reveals that RE discussions are predominantly practical and task-oriented, with strong emphasis on debugging, decompilation, and system-level analysis, while topics related to memory, firmware, and file format analysis exhibit high difficulty and unresolved rates. Beyond empirical characterization, REStack provides a reusable resource for empirical studies, educational research, and the development and evaluation of AI- and LLM-based developer assistance tools for RE. By releasing the dataset and accompanying scripts, this work aims to facilitate reproducible research and advance data-driven support for RE practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents REStack, a dataset of over 12,000 reverse engineering (RE) posts collected from Stack Overflow and the Reverse Engineering Stack Exchange site spanning more than 15 years. It applies Latent Dirichlet Allocation (LDA) with Genetic Algorithm (GA) hyperparameter optimization, followed by manual labeling, to derive 23 topics grouped into six thematic categories. The work enriches the dataset with metadata and difficulty indicators (e.g., unanswered rates, response times), analyzes that RE discussions are predominantly practical/task-oriented (emphasizing debugging, decompilation, system-level analysis) while memory/firmware/file-format topics show high difficulty/unresolved rates, and releases the dataset plus scripts to support empirical studies, education, and AI/LLM tool development for RE.
Significance. If the topic characterization holds, the release of a large-scale, publicly available RE discussion dataset with derived difficulty signals constitutes a reusable resource for empirical SE research, educational analysis, and benchmarking of AI assistance tools. The paper's emphasis on reproducibility via released scripts and data is a clear strength.
major comments (3)
- [Methods] Methods (Topic Modeling and Labeling subsection): No coherence metrics (NPMI, C_V, or similar), topic stability across random seeds, or ablation on the GA objective function are reported for the 23 topics. This directly undermines the claim that the topics are 'semantically coherent' and the downstream grouping into six categories plus difficulty/unresolved-rate analysis.
- [Methods] Methods (Data Collection subsection): Exact search queries, filtering criteria, inclusion/exclusion rules, and post-selection validation steps used to obtain the 12,000+ posts are not specified. This affects both reproducibility of the core dataset and the representativeness of the analyzed RE discussions.
- [Methods] Methods (Labeling process): No inter-rater agreement statistics (e.g., Cohen's kappa or percentage agreement) are provided for the manual labeling of topics and assignment to the six high-level categories. Without this, the subjectivity concern in the post-hoc grouping cannot be assessed.
minor comments (2)
- [Abstract] Abstract and Introduction: The claim of 'no publicly available dataset' should be qualified with a brief comparison to any prior RE-related corpora (even if smaller or narrower) to strengthen novelty positioning.
- [Results] Results section: When reporting unresolved rates and response times per topic, include the raw counts or denominators alongside percentages for transparency.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to revisions that enhance the manuscript's methodological transparency and reproducibility.
read point-by-point responses
-
Referee: [Methods] Methods (Topic Modeling and Labeling subsection): No coherence metrics (NPMI, C_V, or similar), topic stability across random seeds, or ablation on the GA objective function are reported for the 23 topics. This directly undermines the claim that the topics are 'semantically coherent' and the downstream grouping into six categories plus difficulty/unresolved-rate analysis.
Authors: We agree that quantitative validation metrics would strengthen the presentation. In the revised manuscript we will report NPMI and C_V coherence scores for the final 23-topic model. We will also include a stability analysis by re-running LDA across multiple random seeds and reporting average pairwise Jaccard similarity of topic-word distributions. For the GA component we will specify the objective function (perplexity) and note that no ablation was performed due to computational cost; if space allows we will add a brief sensitivity check on key GA parameters. revision: yes
-
Referee: [Methods] Methods (Data Collection subsection): Exact search queries, filtering criteria, inclusion/exclusion rules, and post-selection validation steps used to obtain the 12,000+ posts are not specified. This affects both reproducibility of the core dataset and the representativeness of the analyzed RE discussions.
Authors: We acknowledge that precise collection details are essential. The posts were obtained via the Stack Exchange Data Explorer using tag-based filters ('reverse-engineering' on SO and the dedicated site) combined with keyword matching in titles and bodies. In the revision we will list the exact SQL queries, the date range, minimum score/post-length thresholds, and the manual sampling procedure used to verify relevance. This will enable full replication of the 12k+ post set. revision: yes
-
Referee: [Methods] Methods (Labeling process): No inter-rater agreement statistics (e.g., Cohen's kappa or percentage agreement) are provided for the manual labeling of topics and assignment to the six high-level categories. Without this, the subjectivity concern in the post-hoc grouping cannot be assessed.
Authors: The topic-to-category assignment was performed jointly by the author team with iterative discussion to reach consensus. We will add a dedicated paragraph reporting the agreement process: percentage agreement on the final groupings and, where multiple independent passes were feasible, Cohen's kappa. This will directly address concerns about subjectivity in the six-category taxonomy. revision: yes
Circularity Check
No circularity: empirical dataset construction and topic modeling are self-contained
full rationale
The paper collects external Stack Exchange posts, applies standard LDA with GA hyperparameter search, performs manual labeling, and computes difficulty metrics from community signals. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness claims appear in the provided text. The central claims rest on the collected data and topic outputs rather than reducing to inputs by definition. This is the expected outcome for a dataset paper using off-the-shelf methods.
Axiom & Free-Parameter Ledger
free parameters (2)
- Number of topics =
23
- LDA hyperparameters
axioms (2)
- domain assumption Stack Exchange posts accurately reflect real-world RE practitioner challenges and knowledge gaps
- domain assumption LDA with GA tuning yields semantically meaningful and coherent topics in technical Q&A text
Reference graph
Works this paper leans on
- [1]
-
[2]
S. Ahmed and M. Bagherzadeh. 2018. What Do Concurrency Developers Ask About? A Large-Scale Study Using Stack Overflow. InProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM ’18). 1–10. https://doi.org/10.1145/3239235.3239524
-
[3]
H. Alibrahim and S. Ludwig. 2021. Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization. InProceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC). 1551–1559. https: //doi.org/10.1109/CEC45853.2021.9504761
-
[4]
M. Bagherzadeh and R. Khatchadourian. 2019. Going big: A Large-scale Study on What Big Data Developers Ask. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). 432–442. https://doi.org/ 10.1145/3338906.3338939
-
[5]
D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet Allocation.J. Mach. Learn. Res.3 (2003), 993–1022
2003
-
[6]
Norman Cliff. 1993. Dominance Statistics: Ordinal Analyses to Answer Ordinal Questions.Psychological Bulletin114, 3 (1993), 494–509. https://doi.org/10.1037/ 0033-2909.114.3.494
1993
-
[7]
1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)
Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates, Hillsdale, NJ
1988
-
[8]
Olive Jean Dunn. 1964. Multiple Comparisons Using Rank Sums.Technometrics 6, 3 (1964), 241–252. https://doi.org/10.2307/1266041
doi:10.2307/1266041 1964
-
[9]
Gensim. 2025. https://radimrehurek.com/gensim/. accessed March, 2026
2025
-
[10]
J. Holland. 1992. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence.MIT Press1, 1 (1992), 1–228. https://doi.org/10.7551/mitpress/1090.001.0001
-
[11]
Introduction to card sorting. 2025. https://www.optimalworkshop.com/ 101guides/card-sorting-101/introduction-to-card-sorting. accessed March, 2026
2025
-
[12]
William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis.J. Amer. Statist. Assoc.47, 260 (1952), 583–621. https://doi. org/10.2307/2280779
doi:10.2307/2280779 1952
-
[13]
J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data.Biometrics33, 1 (1977), 159–174. https: //doi.org/10.2307/2529310
doi:10.2307/2529310 1977
-
[14]
C. Li, J. Jiang, Y. Zhao, R. Li, E. Wang, X. Zhang, and K. Zhao. 2021. Genetic Algorithm-Based Hyper-Parameters Optimization for Transfer Convolutional Neural Network.arXiv preprint(2021). https://doi.org/10.48550/arXiv.2103.03875 arXiv:2103.03875
-
[15]
Natural Language Toolkit (NLTK) Stop Words. 2025. https://gist.github.com/ sebleier/554280. accessed March, 2026
2025
-
[16]
M. Openja, B. Adams, and F. Khomh. 2020. Analysis of Modern Release Engineer- ing Topics: A Large-Scale Study Using Stack Overflow. InProceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 104–114. https://doi.org/10.1109/ICSME46990.2020.00020
-
[17]
A. Ouni, I. Saidani, E. Alomar, and M. Mkaouer. 2023. An Empirical Study on Continuous Integration Trends, Topics and Challenges in Stack Overflow. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering. 141–151. https://doi.org/10.1145/3593434.3593485
-
[18]
A. Peruma, S. Simmons, E. A. AlOmar, C. D. Newman, M. W. Mkaouer, and A. Ouni. 2022. How do i refactor this? An empirical study on refactoring trends and topics in Stack Overflow.Empirical Softw. Engg.27, 1 (2022), 1–43. https: //doi.org/10.1007/s10664-021-10045-x
-
[19]
Replication Package. 2026. https://figshare.com/s/a1eca7ed23c8f3b1fe78. accessed March, 2026
2026
-
[20]
Reverse Engineering Site. 2025. https://reverseengineering.stackexchange.com//. accessed March, 2026
2025
-
[21]
Röder, A
M. Röder, A. Both, and A. Hinneburg. 2015. Exploring the space of topic coherence measures. InProceedings of the ACM International Conference on Web Search and Data Mining (WSDM). ACM, 399–408
2015
-
[22]
Romano, J
J. Romano, J. Kromrey, J. Coraggio, and J. Skowronek. 2006. Appropriate Statistics for Ordinal Level Data: Should We Really Be Using t-test and Cohen’s d for Evaluating Group Differences on the NSSE and Other Surveys?Annual Meeting of the Florida Association of Institutional Research(2006), 1–33
2006
-
[23]
C. Rosen and E. Shihab. 2016. What Are Mobile Developers Asking About? A Large-Scale Study Using Stack Overflow.Empirical Software Engineering21 (2016), 1192–1223. https://doi.org/10.1007/s10664-015-9379-3
-
[24]
Saidani, A
I. Saidani, A. Ouni, and M. Mkaouer. 2022. Improving the prediction of continuous integration build failures using deep learning.Automated Software Engineering 29, 1 (2022), 1–61
2022
-
[25]
Charles Spearman. 1904. The Proof and Measurement of Association Between Two Things.The American Journal of Psychology15, 1 (1904), 72–101. https: //doi.org/10.2307/1412159
doi:10.2307/1412159 1904
-
[26]
Stack Exchange. 2025. https://stackexchange.com/. accessed March, 2026
2025
-
[27]
Stack Overflow Site. 2025. https://stackoverflow.com/. accessed March, 2026
2025
-
[28]
G. Uddin, F. Sabir, Y. Guéhéneuc, O. Alam, and F. Khomh. 2021. An Empirical Study of IoT Topics in IoT Developer Discussions on Stack Overflow.Empirical Software Engineering26, 6 (2021). https://doi.org/10.1007/s10664-021-10021-5
-
[29]
L. Yang and A. Shami. 2020. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice.Neurocomputing415 (2020), 295–316. https://doi.org/10.1016/j.neucom.2020.07.061
-
[30]
X. Yang, D. Lo, X. Xia, Z. Wan, and J. Sun. 2016. What security questions do developers ask? a large-scale study of stack overflow posts.Journal of Computer Science and Technology31 (2016), 910–924
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.