arxiv: 2605.02974 · v1 · submitted 2026-05-03 · 💱 q-fin.PR · cs.LG

Recognition: 2 theorem links

· Lean Theorem

PHBench: A Benchmark for Predicting Startup Series A Funding from Product Hunt Launch Signals

Ben Griffin, Rick Chen, Yagiz Ihlamur

Pith reviewed 2026-05-08 18:40 UTC · model grok-4.3

classification 💱 q-fin.PR cs.LG

keywords product huntseries a fundingstartup predictionbenchmark datasetensemble modelllm evaluationcrunchbase matchingearly stage signals

0 comments

The pith

Product Hunt launch signals contain statistically significant information for predicting Series A funding within 18 months.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds PHBench by linking 67,292 Product Hunt posts from 2019-2025 to Crunchbase records through domain matching, yielding 528 verified Series A raises at a 0.78 percent base rate. An ensemble of models selected on validation F0.5 achieves F0.5 of 0.097 and average precision of 0.037 on the blind test set of 103 positives, delivering a 4.7 times lift over random and a statistically confirmed edge over logistic regression. Zero-shot Gemini models perform at or below the baseline, with the strongest variant performing worst, while both ML and LLM results track the post-2021 funding contraction. The work supplies public train-validation-test splits, 61 features, and a leaderboard to support reproducible evaluation of early-stage prediction methods.

Core claim

What carries the argument

PHBench dataset of Product Hunt posts matched to Crunchbase records via deterministic domain matching, together with 61 engineered features and the five-metric evaluation harness that selects the three-component ensemble by validation F0.5.

If this is right

Machine learning models can extract usable early signals for Series A prediction from public launch data alone.
Zero-shot large language models do not exceed traditional ensembles on this task and may underperform the strongest variant.
Model performance declines after the 2021 funding peak in the same pattern as actual market conditions.
The released splits, features, and leaderboard allow direct comparison of new predictors against the reported baselines.
Temporal tracking confirms the signals reflect genuine economic structure rather than dataset artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Screening tools for early investors could incorporate Product Hunt activity as one low-cost filter among other signals.
Extending the benchmark to later funding rounds or acquisition outcomes would test whether the same features generalize.
Fine-tuning or better prompting might close the gap for language models on numerical launch metrics.
Merging Product Hunt features with patent, web-traffic, or team-background data could raise precision beyond the current 0.037 AP.

Load-bearing premise

Deterministic domain matching between Product Hunt posts and Crunchbase records correctly identifies the same companies and funding events without substantial false positives or missed links.

What would settle it

A manual review of several hundred matched pairs that finds frequent incorrect company linkages, or a replication study on a fresh post-2025 cohort where the reported AP falls to the random baseline of roughly 0.008.

read the original abstract

Structured launch signals on Product Hunt contain statistically significant predictive information for Series A funding outcomes. We construct PHBench from 67,292 featured Product Hunt posts spanning 2019-2025, linked to Crunchbase funding records via deterministic domain matching, identifying 528 verified Series A raises within 18 months of launch (positive rate: 0.78%). Our best-performing model, a three-component ensemble (ENS_avg, ENS_ISO, XGB) selected by validation F0.5, achieves F0.5 = 0.097 and AP = 0.037 (95% CI: 0.024-0.072; 4.7x lift over random) on the private held-out test set (103 positives). A paired bootstrap confirms a statistically credible advantage over the logistic regression baseline (AP delta: +0.013, 95% CI: [0.004, 0.039], p < 0.001; F0.5 delta: +0.056, 95% CI: [0.006, 0.122], p = 0.016). Validation-set metrics (F0.5 = 0.284, AP = 0.126) reflect best-of-144 selection bias on 53 positives and are reported for benchmark reproducibility only. We further evaluate three zero-shot Gemini models (Gemini 2.5 Flash, Gemini 3 Flash, and Gemini 3.1 Pro) in an anonymized numerical setting. The best LLM achieves AP = 0.034 (Gemini 3 Flash), below the LR baseline AP of 0.044. Notably, the most capable Gemini variant (Gemini 3.1 Pro, AP = 0.023) performs worst -- an unexpected pattern that warrants further investigation across providers and prompting strategies. Both ML and LLM models show the same temporal performance decay tracking the 2020-2021 funding boom and subsequent contraction, confirming the dataset captures genuine market structure rather than noise. PHBench provides a reproducible framework comprising public training, validation, and blind test splits; 61 engineered features; a five-metric evaluation harness; and a public leaderboard at https://phbench.com. All code, baseline models, and anonymized dataset splits are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PHBench gives a clean public benchmark and modest but statistically backed lift for Series A prediction from launch data, though the Crunchbase linking step is unverified and could affect the results.

read the letter

The core of this paper is a new benchmark that ties Product Hunt launch features to Series A outcomes. They pull 67k posts from 2019-2025, match them deterministically to Crunchbase, and end up with 528 positives. The best ensemble hits F0.5 of 0.097 and AP of 0.037 on the blind test set, with a 4.7x lift over random and a bootstrap-confirmed edge over logistic regression. That is the main deliverable, and it is new in scale and setup for this narrow task.

Referee Report

2 major / 3 minor

Summary. The paper introduces PHBench, a benchmark dataset of 67,292 Product Hunt launch posts (2019-2025) linked to Crunchbase via deterministic domain matching, yielding 528 Series A positives (0.78% rate) within 18 months. It reports that a three-component ensemble (selected via validation F0.5) achieves F0.5=0.097 and AP=0.037 on a private held-out test set (103 positives), with 4.7x lift over random and bootstrap-confirmed gains over logistic regression baseline; zero-shot Gemini LLMs underperform the baseline; temporal decay in performance tracks funding cycles; and all code, splits, and a public leaderboard are released.

Significance. If the linkage procedure is accurate, the work supplies a timely, fully reproducible public benchmark for early-stage funding prediction from launch-platform signals, with notable strengths in bootstrap significance testing, explicit flagging of validation-set selection bias, observation of market-consistent temporal decay, and open release of anonymized data, 61 features, and evaluation harness. This could support follow-on research in startup analytics and predictive modeling at the intersection of product launch data and venture outcomes.

major comments (2)

[Abstract / data linkage] Abstract and data-construction description: the positive class (528 Series A raises) is defined entirely by deterministic domain matching between Product Hunt posts and Crunchbase records, yet no precision/recall estimates, manual audit sample, or sensitivity analysis for matching errors is provided. With a 0.78% base rate, even modest false-positive rates (e.g., name/domain collisions) or false-negative rates (re-branding, subsidiaries, Crunchbase gaps) would directly alter the label distribution, AP, F0.5, lift, and bootstrap p-values that underpin the central claim of statistically detectable predictive signal.
[Evaluation and ensemble construction] Model selection paragraph: the ensemble (ENS_avg, ENS_ISO, XGB) is chosen by best-of-144 validation F0.5 search; although the paper correctly notes the resulting upward bias in validation metrics, the test-set gains (AP delta +0.013, F0.5 delta +0.056) should be accompanied by an explicit statement that the final ensemble weights and component selection were frozen before test-set evaluation and that no further hyper-parameter tuning occurred on test data.

minor comments (3)

[Abstract] Abstract: the phrase 'verified Series A raises' is used without clarifying that verification is limited to deterministic matching; a brief parenthetical on the matching rule would improve precision.
[LLM experiments] LLM evaluation: the inverse scaling result (Gemini 3.1 Pro worst at AP=0.023) is noted as unexpected; adding a short discussion of prompt variations or numerical anonymization details would help readers interpret whether this reflects model capability or experimental setup.
[Feature set] Feature engineering: the 61 engineered features are referenced but not enumerated or categorized (e.g., engagement, timing, textual); a compact table or appendix listing them would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract / data linkage] Abstract and data-construction description: the positive class (528 Series A raises) is defined entirely by deterministic domain matching between Product Hunt posts and Crunchbase records, yet no precision/recall estimates, manual audit sample, or sensitivity analysis for matching errors is provided. With a 0.78% base rate, even modest false-positive rates (e.g., name/domain collisions) or false-negative rates (re-branding, subsidiaries, Crunchbase gaps) would directly alter the label distribution, AP, F0.5, lift, and bootstrap p-values that underpin the central claim of statistically detectable predictive signal.

Authors: We agree that the accuracy of the deterministic domain matching is a critical detail given the low base rate. Our procedure uses exact domain matching to reduce name-collision false positives, but we acknowledge that re-branding, subsidiaries, and Crunchbase coverage gaps could introduce false negatives. The original submission did not include precision/recall estimates or a manual audit. In the revised manuscript we will add a dedicated paragraph in the Data Construction section that (i) explicitly discusses the expected direction and magnitude of matching errors, (ii) reports a sensitivity analysis performed by relaxing the domain-matching criteria on a random subsample of 1,000 posts, and (iii) quantifies the potential impact of such errors on the reported AP, F0.5, and bootstrap confidence intervals. A full manual audit of all 67k posts remains outside the scope of the current work, but the added discussion will make the limitations transparent. revision: partial
Referee: [Evaluation and ensemble construction] Model selection paragraph: the ensemble (ENS_avg, ENS_ISO, XGB) is chosen by best-of-144 validation F0.5 search; although the paper correctly notes the resulting upward bias in validation metrics, the test-set gains (AP delta +0.013, F0.5 delta +0.056) should be accompanied by an explicit statement that the final ensemble weights and component selection were frozen before test-set evaluation and that no further hyper-parameter tuning occurred on test data.

Authors: We thank the referee for this precise observation. The manuscript already flags the upward bias in validation metrics due to best-of-144 selection, but we agree that an explicit statement confirming the test set remained untouched is necessary. All component selection, weight optimization, and hyper-parameter choices were performed exclusively on the validation set; the final ensemble was frozen before any test-set evaluation, and no further tuning or adjustments were made using test data. In the revised manuscript we will insert a clear sentence in both the Model Selection and Evaluation sections stating that the test set was held completely blind throughout model development and selection. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper's central derivation constructs labels via deterministic domain matching between Product Hunt posts and external Crunchbase records, engineers 61 features from launch signals, trains models on public training/validation splits, and reports standard metrics (AP, F0.5) on a private held-out test set. No equations or results reduce to inputs by construction; reported performance is not a tautological fit or renaming of the matching process itself. No self-citations are load-bearing, no uniqueness theorems are invoked, and no ansatzes are smuggled. The temporal decay pattern and public reproducibility further confirm the chain is self-contained against external data sources rather than internally forced.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Central claim rests on accurate cross-source entity resolution and the assumption that the observed 0.78% positive rate and feature distributions generalize to the target prediction task.

free parameters (2)

Ensemble component selection by validation F0.5
Choice of which models to average and the selection metric itself are tuned on the validation set containing only 53 positives.
61 engineered features
Feature definitions and transformations involve design choices whose exact impact on performance is not detailed in the abstract.

axioms (1)

domain assumption Deterministic domain matching between Product Hunt and Crunchbase records produces accurate company and funding linkages
This matching step directly determines the 528 positive labels used for training and evaluation.

pith-pipeline@v0.9.0 · 5742 in / 1526 out tokens · 33114 ms · 2026-05-08T18:40:05.954106+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages

[1]

[Arroyo2019] Arroyo, J., Corea, F., Jimenez-Diaz, G., & Recio-Garcia, J. A. (2019). Assessment of machine learning performance for decision support in venture capital investments. IEEE Access, 7, 124233–124243. [Davis2006] Davis, J., & Goadrich, M. (2006). The relationship between precision- recall and ROC curves.Proceedings of the 23rd International Conf...

work page arXiv 2019
[2]

[Hollmann2023] Hollmann, N., Müller, S., Eggensperger, K., & Hutter, F. (2023). TabPFN: A transformer that solves small tabular classification problems in a second.Proceedings of ICLR

2023
[3]

Tabpfn: A transformer that solves small tabu- lar classification problems in a second,

arXiv:2207.01848. [Chen2025] Chen, R., Ternasky, J., Kwesi, A. S., Griffin, B., Yin, A. O., Salifu, Z., Amoaba, K., Mu, X., Alican, F., & Ihlamur, Y. (2025). VCBench: Benchmarking LLMs in Venture Capital. arXiv:2509.14448. https://vcbench.com [Ihlamur2026]Ihlamur, Y.(2026). WhenCareerDataRunsOut: EvaluatingFron- tier LLMs on Venture Capital Reasoning.IEEE IDS

work page arXiv 2025
[4]

[Lovable2025] Sawers, P

arXiv:2604.00339. [Lovable2025] Sawers, P. (2025). Lovable becomes a unicorn with $200M Se- ries A, just 8 months after launch. TechCrunch, July 17,

work page arXiv 2025
[5]

Retrieved from https://techcrunch.com/2025/07/17/lovable-becomes-a-unicorn-with-200m- series-a-just-8-months-after-launch/ 24 [Mollick2014] Mollick, E. (2014). The dynamics of crowdfunding: An exploratory study. Journal of Business Venturing, 29(1), 1–16. [Jimenez2024] Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (20...

2025
[6]

N., Bischl, B., & Torgo, L

[Vanschoren2014] Vanschoren, J., van Rijn, J. N., Bischl, B., & Torgo, L. (2014). OpenML: Networked science in machine learning.ACM SIGKDD Explorations, 15(2), 49–60. [PitchBook2023] PitchBook & NVCA. (2023). Q4 2023 PitchBook-NVCA Venture Monitor. National Venture Capital Association. Retrieved from https://pitchbook.com/news/reports/q4-2023-pitchbook-nv...

2014