arxiv: 2604.02378 · v2 · submitted 2026-04-01 · 💻 cs.LG · q-fin.GN

Recognition: no theorem link

YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches

Mostapha Benhenda

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:04 UTC · model grok-4.3

classification 💻 cs.LG q-fin.GN

keywords startup forecastingYC BenchY CombinatorbenchmarkDemo Daytraction signalspublic dataearly prediction

0 comments

The pith

YC Bench creates a live benchmark that scores startup outperformance in YC batches using public signals before Demo Day.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that forecasting startup success can be sped up dramatically by leveraging Y Combinator batches, where roughly 200 companies are funded at the same time and evaluated together at Demo Day only three months later. It introduces the Pre-Demo Day Score as a short-term KPI built from publicly available traction signals and web visibility to identify early outperformance within each batch. This metric turns slow, years-long evaluation cycles into rapid monthly iterations for testing forecasting models. On the W26 batch of 196 startups, a simple baseline of Google mentions recovers 6 of the 11 top performers at Demo Day, for 55 percent recall. The result is a live, public benchmark that lets researchers develop and measure prediction methods on real batches without waiting for distant exits or large funding rounds.

Core claim

YC Bench is a live benchmark for forecasting early outperformance within YC batches. It defines outperformance via the Pre-Demo Day Score, a KPI that combines publicly available traction signals and web visibility, and demonstrates the setup on the W26 batch where a Google-mentions baseline recovers 55 percent of top Demo Day performers.

What carries the argument

The Pre-Demo Day Score, a KPI that aggregates publicly available traction signals and web visibility to rank startups before Demo Day.

If this is right

Forecasting models can be evaluated and iterated on new YC batches every few months rather than years.
Public data sources like Google mentions provide usable baselines that already recover more than half of top batch performers.
Research on startup prediction can now use repeated, short-cycle experiments across successive YC batches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the score holds across batches, sequential training on multiple YC cohorts could steadily improve model accuracy.
The batch structure might be adapted to create similar fast benchmarks for other accelerators or early-stage funding programs.
Adding richer public signals beyond Google mentions could raise recall well above the 55 percent baseline shown.

Load-bearing premise

The Pre-Demo Day Score serves as a valid short-term proxy for meaningful outperformance that generalizes beyond the W26 batch.

What would settle it

Apply the Pre-Demo Day Score to the next YC batch and measure whether its top-ranked startups actually appear among the strongest performers at that batch's Demo Day, or track long-term outcomes such as funding and exits for the W26 top scorers.

Figures

Figures reproduced from arXiv: 2604.02378 by Mostapha Benhenda.

**Figure 2.** Figure 2: Traction Score Leaderboard for YC W26 (log scale). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Top 20 domains (top 10%) by Google mentions during the YC W26 batch (January 1 – [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Attention is highly concentrated among YC W26 startups. Left: Distribution of Google [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Top 20 domains by Google mentions before the YC W26 application deadline (August 17 – October 31, 2025). This is the signal used by the baseline. 6.2 Baseline Performance [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Traction Score Leaderboard — Baseline Coverage. Green = captured by the Google [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Top 20 startups by Pre-Demo Day Score — Baseline Coverage. Green bars = captured by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Forecasting startup success is notoriously difficult, partly because meaningful outcomes, such as exits, large funding rounds, and sustained revenue growth, are rare and can take years to materialize. As a result, signals are sparse and evaluation cycles are slow. Y Combinator batches offer a unique mitigation: each batch comprises around 200 startups, funded simultaneously, with evaluation at Demo Day only three months later. We introduce YC Bench, a live benchmark for forecasting early outperformance within YC batches. Using the YC W26 batch as a case study (196 startups), we measure outperformance with a Pre-Demo Day Score, a KPI combining publicly available traction signals and web visibility. This short-term metric enables rapid evaluation of forecasting models. As a baseline, we take Google mentions prior to the YC W26 application deadline, a simple proxy for prior brand recognition, recovering 6 of 11 top performers at YC Demo Day (55% recall). YC Bench provides a live benchmark for studying startup success forecasting, with iteration cycles measured in months rather than years. Code and Data are available on GitHub: https://github.com/benstaf/ycbench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YC Bench is a clean benchmark release with public code and data for short-cycle YC startup forecasting, plus a simple reproducible baseline, but the Pre-Demo Day Score lacks any check against longer-term outcomes.

read the letter

The main takeaway is that this paper releases YC Bench, a live benchmark built around Y Combinator batches, with the W26 cohort as the first case study. It defines a Pre-Demo Day Score from public traction signals and web visibility, then shows a baseline using pre-application Google mentions that recovers 55% of the top performers at Demo Day. Code and data are on GitHub, which is the part that actually matters for anyone who wants to use it.

Referee Report

1 major / 2 minor

Summary. The paper introduces YC Bench, a live benchmark for forecasting early outperformance within Y Combinator batches. Using the W26 batch (196 startups) as a case study, it defines a Pre-Demo Day Score combining public traction signals and web visibility as a short-term KPI, releases associated code and data, and reports a simple baseline (pre-application Google mentions) that recovers 6 of 11 top performers at Demo Day (55% recall). The central contribution is the benchmark itself, which enables month-scale iteration cycles rather than multi-year waits for exits or large rounds.

Significance. If the benchmark and its short-term proxy hold, the work supplies a reproducible, public resource that lowers the barrier to developing and iterating on startup-success forecasting models. The explicit release of code, data, and an unfitted external baseline (Google mentions) is a concrete strength that supports community use and rapid evaluation.

major comments (1)

[Baseline and Case Study] The manuscript does not specify the exact criteria used to identify the 11 'top performers' whose recall is reported for the Google-mentions baseline. Without this definition, it is impossible to assess selection effects, reproducibility of the 55% figure, or whether the Pre-Demo Day Score is being evaluated against a stable target.

minor comments (2)

[Pre-Demo Day Score] The Pre-Demo Day Score is described as combining traction signals and web visibility, but the precise weighting or aggregation formula is not stated; an explicit equation or pseudocode would improve reproducibility.
The GitHub repository link is provided but no commit hash or snapshot is given; pinning the exact version used for the W26 results would strengthen the reproducibility claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the referee's thoughtful review and recommendation for minor revision. We are pleased that the significance of the benchmark is recognized. We address the major comment point by point below.

read point-by-point responses

Referee: [Baseline and Case Study] The manuscript does not specify the exact criteria used to identify the 11 'top performers' whose recall is reported for the Google-mentions baseline. Without this definition, it is impossible to assess selection effects, reproducibility of the 55% figure, or whether the Pre-Demo Day Score is being evaluated against a stable target.

Authors: We thank the referee for highlighting this omission. The 11 top performers are defined as the startups that rank highest according to the Pre-Demo Day Score within the W26 batch. This choice makes the evaluation self-consistent with the short-term KPI we propose, allowing for rapid iteration. To address the concern about selection effects and reproducibility, we will explicitly state this definition in the revised manuscript, along with the precise formula for the Pre-Demo Day Score (a combination of public traction signals and web visibility). We believe this will make the 55% recall figure fully reproducible. Regarding the stability of the target, the Pre-Demo Day Score serves as the target by design, as it captures early outperformance observable at Demo Day time scale. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central contribution is the release of YC Bench, a live benchmark defined explicitly via the Pre-Demo Day Score (a transparent combination of public traction signals and web visibility) for the W26 batch, together with an unfitted external baseline of pre-application Google mentions. No equations, predictions, or claims reduce by construction to fitted inputs or self-citations; the metric is presented as a short-term proxy without hidden assumptions about long-term generalization, and the work is self-contained against external benchmarks with released code and data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that short-term public signals can serve as a usable proxy for outperformance. No free parameters are fitted in the baseline; the only modeling choice is the definition of the Pre-Demo Day Score itself.

axioms (1)

domain assumption Public traction signals and web visibility are sufficient to construct a meaningful short-term outperformance metric.
Invoked when defining the Pre-Demo Day Score as the evaluation target.

pith-pipeline@v0.9.0 · 5501 in / 1243 out tokens · 86178 ms · 2026-05-13T23:04:59.490659+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping

URL https://arxiv.org/abs/ 1904.08171. Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping. Scaling Open- Ended Reasoning to Predict the Future,

work page arXiv 1904
[2]

Rick Chen, Joseph Ternasky, Afriyie Samuel Kwesi, et al

URL https://arxiv.org/abs/2512.25070. Rick Chen, Joseph Ternasky, Afriyie Samuel Kwesi, et al. VCBench: Benchmarking LLMs in Venture Capital,

work page arXiv
[3]

VCBench: Benchmarking LLMs in Venture Capital

URL https://arxiv.org/abs/2509.14448. Tianle Chen, R. Matthew Hutchison, Carrie Rubel, et al. A review of evidence supporting amyloid beta reduction as a surrogate endpoint in Alzheimer’s disease.The Journal of Prevention of Alzheimer’s Disease, 13(2):100458,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

URL https://doi.org/10.1016/j.tjpad.2025.100458. Susan L. Cohen, Christopher B. Bingham, and Benjamin L. Hallen. The Design of Startup Accelera- tors. Research Policy, 48(7):1781–1797,

work page doi:10.1016/j.tjpad.2025.100458 2025
[5]

doi: https://doi.org/10.1016/j.mlwa.2021.100062

ISSN 2666-8270. doi: https://doi.org/10.1016/j.mlwa.2021.100062. URL https://www.sciencedirect.com/science/article/pii/S2666827021000311. Thomas R. Fleming and David L. DeMets. Surrogate End Points in Clinical Trials: Are We Being Misled? Annals of Internal Medicine, 125(7):605–613,

work page doi:10.1016/j.mlwa.2021.100062 2021
[6]

The Varying Importance of Extrinsic Factors in the Success of Startup Fundraising: Competition, Market Timing and Startup Ecosystem

Charly Gastaud, Thibault Carniel, and Jean-Michel Dalle. The Varying Importance of Extrinsic Factors in the Success of Startup Fundraising: Competition, Market Timing and Startup Ecosystem. arXiv preprint arXiv:1906.03210,

work page arXiv 1906
[7]

Benjamin L

URL https://arxiv.org/abs/2312.06236. Benjamin L. Hallen, Susan L. Cohen, and Christopher B. Bingham. Do accelerators accelerate? A study of venture accelerators as a path to success? Academy of Management Proceedings, 2014(1): 12955,

work page arXiv 2014
[8]

Predicting the outcome of startups: Less failure, more success

Amar Krishna, Ankit Agrawal, and Alok Choudhary. Predicting the outcome of startups: Less failure, more success. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pages 798–805,

work page 2016
[9]

Jinze Li

doi: 10.1109/ICDMW.2016.0118. Jinze Li. Prediction of the Success of Startup Companies Based on Support Vector Machine and Random Forset. In 2020 2nd International Workshop on Artificial Intelligence and Education, W AIE 2020, page 5–11, New York, NY , USA,

work page doi:10.1109/icdmw.2016.0118 2016
[10]

ISBN 9781450388252

Association for Computing Machinery. ISBN 9781450388252. URL https://doi.org/10.1145/3447490.3447492. Jiashuo Liu et al. FutureX-Pro: Extending Future Prediction to High-Value Vertical Domains,

work page doi:10.1145/3447490.3447492
[11]

arXiv:2601.12259

URL https://arxiv.org/abs/2601.12259. arXiv:2601.12259. Mucan Liu, Manting Hu, and Junming Liu. A Graph Learning Model of Network Resources for Early Stage Startup Success Prediction. In Proceedings of the 44th International Conference on Information Systems (ICIS

work page arXiv
[12]

Metaculus

URL https://arxiv.org/ abs/2601.16568. Metaculus. Metaculus: A Forecasting Platform and AI Benchmarking Environment. https: //www.metaculus.com,

work page arXiv
[13]

Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, et al

URLhttps://doi.org/10.1145/ 3269206.3272011. Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, et al. FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction. arXiv preprint arXiv:2508.11987,

work page arXiv