Recognition: no theorem link
TubeCensus: A Transparent, Replicable, and Large-Scale Census of YouTube Channels and their Subscriber Counts Over Time
Pith reviewed 2026-05-11 01:14 UTC · model grok-4.3
The pith
TubeCensus builds a historical record of YouTube channels and subscriber counts by linking two decades of archived page captures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TubeCensus organizes nearly twenty years of YouTube page captures from the Internet Archive into a longitudinal dataset of channels and subscriber counts. This construction is fully transparent and replicable, avoids any interaction with the official YouTube API, and achieves coverage of creators behind 30-36 percent of platform content while including most prominent ones.
What carries the argument
The collection, linking, and organization of Internet Archive captures of YouTube pages into historical channel and subscriber records.
If this is right
- Researchers gain access to time-series subscriber data for studying how creator audiences evolve in response to platform algorithm updates.
- The same public archive sources can be used by others to replicate or extend the census without depending on changing API outputs.
- Initial analysis of channel content types and growth mechanisms becomes possible at a scale that covers a meaningful portion of the platform.
- The pip package allows direct use of the cleaned dataset while hiding the details of YouTube identifiers and capture linking.
Where Pith is reading between the lines
- The same archive-linking method could be adapted to build comparable historical datasets for other platforms that have been regularly captured by web archives.
- Combining TubeCensus subscriber histories with separate video metadata could support tests of whether specific content formats drive sustained audience growth.
- Periodic updates to the dataset would let researchers track ongoing changes in the creator landscape as new channels appear and older ones evolve.
Load-bearing premise
The Internet Archive captures of YouTube pages are complete enough and can be accurately linked to specific channels across different times without major gaps or identifier errors.
What would settle it
Finding that a substantial share of high-view or high-subscriber channels identified in independent sources are missing from TubeCensus or show mismatched subscriber histories would challenge the coverage and accuracy claims.
Figures
read the original abstract
YouTube is central to contemporary mass media. However, the official YouTube API does not provide access to the full set of creators or creator metadata on the platform. This lack of basic visibility into the YouTube ecosystem hinders understanding of the platform's creator economy. Researchers currently have no easy, transparent, or replicable way to construct large-scale datasets of YouTube creators and their audiences over time. This makes it challenging to study vital social questions, such as how changes to the YouTube recommendation algorithm shape creator incentives and by extension the mass media on the platform. We address this gap with TubeCensus, a large-scale longitudinal dataset of YouTube creators and subscriber counts, constructed by collecting, linking, and organizing nearly two decades of YouTube page captures from the Internet Archive. This approach is transparent and replicable and does not require interaction with the YouTube API, whose output can change over time. We validate the coverage of TubeCensus against prior estimates of YouTube's size and find that our resource includes creators responsible for at least 30-36% of all YouTube content. We also find that TubeCensus provides good coverage of prominent creators. To support future research, we hide the substantial complexities of the YouTube identifier system and Internet Archive capture system by distributing our dataset via an easy-to-use pip package. Finally, we use our resource to complete basic exploratory analysis of YouTube channel content and the mechanisms associated with YouTube channel growth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TubeCensus, a longitudinal dataset of YouTube channels and subscriber counts constructed by collecting, linking, and organizing nearly two decades of YouTube page captures from the Internet Archive. It claims to cover creators responsible for at least 30-36% of all YouTube content with good coverage of prominent creators, distributes the resource via an easy-to-use pip package that abstracts away identifier and capture complexities, and includes basic exploratory analysis of channel content and growth mechanisms.
Significance. If the coverage and linking claims hold, TubeCensus would be a valuable public resource for social science research on the YouTube creator economy, enabling replicable studies of platform dynamics without dependence on the official API. The explicit strengths are the transparent, API-free construction from public captures, the pip package for accessibility, and the focus on longitudinal subscriber data; these directly address the stated gap in visibility into creator incentives and mass media on the platform.
major comments (1)
- [Abstract] Abstract and validation description: the central 30-36% coverage claim (and the 'good coverage of prominent creators' statement) is load-bearing for the paper's contribution, yet the provided text supplies no methods, prior size estimates referenced, matching procedure for channel identifiers across snapshots, precision/recall, or sensitivity analysis for IA incompleteness and ID changes (usernames to UC... IDs). This prevents assessment of whether the percentage is robust or affected by systematic gaps.
minor comments (3)
- The manuscript should include a dedicated methods section (or subsection) detailing the linking algorithm, deduplication rules, and any exclusion criteria for captures or channels to support replicability claims.
- Clarify in the exploratory analysis section how subscriber counts are aggregated or interpolated across irregular IA snapshot dates, and whether any temporal alignment or normalization is applied.
- The pip package description would benefit from a short usage example or API reference in the main text or appendix to demonstrate how complexities are hidden for end users.
Simulated Author's Rebuttal
We thank the referee for their careful review and for identifying the need for greater transparency around our coverage validation. We address the single major comment below and will incorporate the requested details into the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and validation description: the central 30-36% coverage claim (and the 'good coverage of prominent creators' statement) is load-bearing for the paper's contribution, yet the provided text supplies no methods, prior size estimates referenced, matching procedure for channel identifiers across snapshots, precision/recall, or sensitivity analysis for IA incompleteness and ID changes (usernames to UC... IDs). This prevents assessment of whether the percentage is robust or affected by systematic gaps.
Authors: We agree that the abstract is too concise on validation and that the manuscript should make the supporting methods, estimates, and robustness checks explicit. The full paper contains a dedicated Validation section that compares TubeCensus to prior published estimates of total YouTube channels and content volume; we will add a one-sentence summary of those references and the resulting 30-36% range directly into the abstract. The channel-linking procedure resolves usernames to UC... IDs across snapshots and uses secondary metadata (titles, descriptions, and upload counts) to handle identifier changes; we will insert a brief description of this multi-identifier matching into both the abstract and the Validation section. We will also add (i) precision/recall figures obtained from manual annotation of a random sample of channels and (ii) a sensitivity analysis that varies the number of Internet Archive captures retained and reports the resulting coverage bounds. These additions will be included in the revised manuscript. revision: yes
Circularity Check
No circularity: data aggregation from external captures with external validation
full rationale
The paper describes construction of TubeCensus via collection, linking, and organization of Internet Archive YouTube page captures, followed by empirical validation of coverage against prior independent estimates of YouTube's size. No mathematical derivations, fitted parameters, predictions, or self-citations appear in the provided text that reduce any central claim to its own inputs by construction. The 30-36% coverage figure is presented as a direct measurement result rather than a self-referential output, and the methodology is framed as transparent and replicable using public external data. This is a standard data-resource paper with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Internet Archive captures provide sufficient coverage and accurate linking for YouTube channel identifiers and subscriber counts over time.
Reference graph
Works this paper leans on
-
[2]
The Oxford Word of the Year 2025 is rage bait , author =. 2025 , month = dec, day =
work page 2025
-
[3]
Tubes and bubbles topological confinement of YouTube recommendations , year =. PLOS ONE , publisher =. doi:10.1371/journal.pone.0231703 , author =
-
[4]
Predicting the Leading Political Ideology of YouTube Channels Using Acoustic, Textual, and Metadata Information , author=. Interspeech , year=
-
[5]
Ribeiro, Manoel Horta and Ottoni, Raphael and West, Robert and Almeida, Virg\'. Auditing radicalization pathways on YouTube , year =. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency , pages =. doi:10.1145/3351095.3372879 , abstract =
-
[6]
and Bisbee, James and Lai, Angela and Bonneau, Richard and Nagler, Jonathan and Tucker, Joshua A
Brown, Megan A. and Bisbee, James and Lai, Angela and Bonneau, Richard and Nagler, Jonathan and Tucker, Joshua A. , title =. 2022 , howpublished =
work page 2022
-
[7]
Kingsley, Sara and Sinha, Proteeti and Wang, Clara and Eslami, Motahhare and Hong, Jason I. , title =. Proc. ACM Hum.-Comput. Interact. , month = nov, articleno =. 2022 , issue_date =. doi:10.1145/3555149 , abstract =
-
[8]
and Bisbee, James and Bonneau, Richard and Tucker, Joshua A
Lai, Angela and Brown, Megan A. and Bisbee, James and Bonneau, Richard and Tucker, Joshua A. and Nagler, Jonathan , title =. 2022 , howpublished =
work page 2022
- [9]
- [10]
-
[11]
Wilkinson, Mark D. and Dumontier, Michel and Aalbersberg, IJsbrand Jan and Appleton, Gabrielle and Axton, Myles and Baak, Arie and Blomberg, Niklas and Boiten, Jan-Willem and da Silva Santos, Luiz Bonino and Bourne, Philip E. and Bouwman, Jildau and Brookes, Anthony J. and Clark, Tim and Crosas, Merc. The FAIR Guiding Principles for scientific data manage...
-
[12]
Proceedings of the International AAAI Conference on Web and Social Media , author=
A Data-Driven Study of View Duration on YouTube , volume=. Proceedings of the International AAAI Conference on Web and Social Media , author=. 2021 , month=. doi:10.1609/icwsm.v10i1.14781 , abstractNote=
-
[14]
Proceedings of the International AAAI Conference on Web and Social Media , author=
The YouTube Social Network , volume=. Proceedings of the International AAAI Conference on Web and Social Media , author=. 2021 , month=. doi:10.1609/icwsm.v6i1.14243 , abstractNote=
-
[15]
TMG Journal for Media History , year =
Susan Aasman , title =. TMG Journal for Media History , year =. doi:10.18146/tmg.435 , url =
- [16]
-
[17]
Counterfactuals and Causal Inference: Methods and Principles for Social Research , author =. 2014 , publisher =. doi:10.1017/CBO9781107587991 , isbn =
-
[18]
Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)
-
[19]
Classification Problem Solving
Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence
- [20]
-
[21]
New Ways to Make Microcircuits Smaller---Duplicate Entry
Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science
-
[22]
Clancey and Glenn Rennels , abstract =
Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =
-
[23]
Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies
-
[24]
Poligon: A System for Parallel Problem Solving
Rice, James. Poligon: A System for Parallel Problem Solving
-
[25]
Transfer of Rule-Based Expertise through a Tutorial Dialogue
Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue
-
[26]
The Engineering of Qualitative Models
Clancey, William J. The Engineering of Qualitative Models
- [27]
- [28]
-
[29]
Most Visited Sites in January 2024
work page 2024
- [30]
- [31]
-
[32]
Chen and Brendan Nyhan and Jason Reifler and Ronald E
Annie Y. Chen and Brendan Nyhan and Jason Reifler and Ronald E. Robertson and Christo Wilson , title =. Science Advances , volume =. 2023 , doi =. https://www.science.org/doi/pdf/10.1126/sciadv.add8080 , abstract =
-
[33]
Like, Comment, Subscribe: Inside YouTube's Chaotic Rise to World Domination , author=. 2022 , publisher=
work page 2022
-
[34]
The Eleventh International Conference on Learning Representations , year=
Modeling content creator incentives on algorithm-curated platforms , author=. The Eleventh International Conference on Learning Representations , year=
-
[35]
Bakshy, Eytan and Hofman, Jake M. and Mason, Winter A. and Watts, Duncan J. , title =. Proceedings of the Fourth ACM International Conference on Web Search and Data Mining , pages =. 2011 , isbn =. doi:10.1145/1935826.1935845 , abstract =
-
[36]
Nan Li and Avery Haviv and Mitchell J. Lovett , title =. Marketing Science , year =. doi:10.1287/mksc.2021.0242 , url =
-
[37]
Journal of Broadcasting & Electronic Media , volume =
Jaeho Cho, Saifuddin Ahmed, Martin Hilbert, Billy Liu and Jonathan Luu , title =. Journal of Broadcasting & Electronic Media , volume =. 2020 , publisher =. doi:10.1080/08838151.2020.1757365 , URL =
-
[38]
Adam and Clutton, Peter and Klein, Colin , year=
Alfano, Mark and Fard, Amir Ebrahimi and Carter, J. Adam and Clutton, Peter and Klein, Colin , year=. Technologically scaffolded atypical cognition: The case of YouTube’s Recommender System - Synthese , url=. SpringerLink , publisher=
-
[39]
Homa Hosseinmardi and Amir Ghasemian and Aaron Clauset and Markus Mobius and David M. Rothschild and Duncan J. Watts , title =. Proceedings of the National Academy of Sciences , volume =. 2021 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2101967118 , abstract =
-
[40]
Unpublished manuscript, New York University
Adolescent mood disorders since 2010: A collaborative review , author=. Unpublished manuscript, New York University. Retrieved , volume=
work page 2010
-
[41]
Unpublished manuscript, New York University , year=
Social media and political dysfunction: A collaborative review , author=. Unpublished manuscript, New York University , year=
-
[42]
Dunna, Arun and Keith, Katherine A. and Zuckerman, Ethan and Vallina-Rodriguez, Narseo and O'Connor, Brendan and Nithyanand, Rishab , title =. Proc. ACM Hum.-Comput. Interact. , month =. 2022 , issue_date =. doi:10.1145/3555209 , abstract =
-
[43]
Journal of Quantitative Description: Digital Media , author=
Dialing for Videos: A Random Sample of YouTube , volume=. Journal of Quantitative Description: Digital Media , author=. 2023 , month=. doi:10.51685/jqd.2023.022 , abstractNote=
-
[44]
Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference , pages =
Zhou, Jia and Li, Yanhua and Adhikari, Vijay Kumar and Zhang, Zhi-Li , title =. Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference , pages =. 2011 , isbn =. doi:10.1145/2068816.2068851 , abstract =
-
[45]
Xavier Ribes , title =. Animation , volume =. 2020 , doi =. https://doi.org/10.1177/1746847720969990 , abstract =
-
[46]
Social Media + Society , year=
Algorithmic Experts: Selling Algorithmic Lore on YouTube , author=. Social Media + Society , year=
-
[47]
Salganik and Peter Sheridan Dodds and Duncan J
Matthew J. Salganik and Peter Sheridan Dodds and Duncan J. Watts , title =. Science , volume =. 2006 , doi =. https://www.science.org/doi/pdf/10.1126/science.1121066 , abstract =
- [48]
- [49]
-
[50]
The algorithm is like a mercurial god
“The algorithm is like a mercurial god”: Exploring content creators’ perception of algorithmic agency on YouTube , author=. New Media & Society , year=
-
[51]
“How it actually works”: Algorithmic lore videos as market devices , author=. New Media & Society , year=
-
[52]
Gummadi, Peter Druschel, and Bobby Bhattacharjee
Mislove, Alan and Marcon, Massimiliano and Gummadi, Krishna P. and Druschel, Peter and Bhattacharjee, Bobby , title =. Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement , pages =. 2007 , isbn =. doi:10.1145/1298306.1298311 , abstract =
-
[53]
Emergence of scaling in random networks,
Albert-László Barabási and Réka Albert , title =. Science , volume =. 1999 , doi =. https://www.science.org/doi/pdf/10.1126/science.286.5439.509 , abstract =
-
[54]
Political Communication , volume =
Deen Freelon , title =. Political Communication , volume =. 2018 , publisher =. doi:10.1080/10584609.2018.1477506 , URL =
-
[55]
Social Media + Society , volume =
Rebekah Tromble , title =. Social Media + Society , volume =. 2021 , doi =. https://doi.org/10.1177/2056305121988929 , abstract =
-
[57]
Information, Communication & Society , volume =
Axel Bruns , title =. Information, Communication & Society , volume =. 2019 , publisher =. doi:10.1080/1369118X.2019.1637447 , URL =
-
[58]
Frontiers in Sociology , VOLUME=
Trezza, Domenico , TITLE=. Frontiers in Sociology , VOLUME=. 2023 , URL=. doi:10.3389/fsoc.2023.1145038 , ISSN=
-
[59]
SocialBlade: YouTube, Instagram, Twitch, TikTok, and More Statistics , howpublished =. 2008--2026 , note =
work page 2008
- [60]
-
[61]
YouTube search, now optimized for time watched , howpublished =. 2012 , url =
work page 2012
-
[62]
Changes to Related and Recommended Videos , howpublished =. 2012 , url =
work page 2012
- [63]
-
[64]
Your 15 Minutes of Fame..ummm...Make that 10 Minutes or Less , howpublished =. 2006 , url =
work page 2006
- [65]
-
[66]
Easy data, same old platforms?
\". Easy data, same old platforms?. Information, Communication & Society , volume =. 2023 , publisher =. doi:10.1080/1369118X.2021.2013918 , url =
-
[67]
Aasman, S. 2019. Finding Traces in YouTube’s Living Archive: Exploring Informal Archival Practices. TMG Journal for Media History, 22(1): 35--55. Published November 6, 2019
work page 2019
-
[68]
Bergen, M. 2022. Like, Comment, Subscribe: Inside YouTube's Chaotic Rise to World Domination. Penguin
work page 2022
-
[69]
A.; Bisbee, J.; Lai, A.; Bonneau, R.; Nagler, J.; and Tucker, J
Brown, M. A.; Bisbee, J.; Lai, A.; Bonneau, R.; Nagler, J.; and Tucker, J. A. 2022. Echo Chambers, Rabbit Holes, and Algorithmic Bias: How YouTube Recommends Content to Real Users. SSRN Working Paper. Posted May 11, 2022. Available at https://ssrn.com/abstract=4114905
work page 2022
-
[70]
Bruns, A. 2019. After the ‘APIcalypse’: social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11): 1544--1566
work page 2019
-
[71]
Dinkov, Y.; Ali, A.; Koychev, I.; and Nakov, P. 2019. Predicting the Leading Political Ideology of YouTube Channels Using Acoustic, Textual, and Metadata Information. In Interspeech
work page 2019
-
[72]
A.; Zuckerman, E.; Vallina-Rodriguez, N.; O'Connor, B.; and Nithyanand, R
Dunna, A.; Keith, K. A.; Zuckerman, E.; Vallina-Rodriguez, N.; O'Connor, B.; and Nithyanand, R. 2022. Paying Attention to the Algorithm Behind the Curtain: Bringing Transparency to YouTube's Demonetization Algorithms. Proc. ACM Hum.-Comput. Interact., 6(CSCW2)
work page 2022
-
[73]
Freelon, D. 2018. Computational Research in the Post-API Age. Political Communication, 35(4): 665--668
work page 2018
-
[74]
Goodrow, C. 2017. You know what’s cool? A billion hours. YouTube Blog, News & Events. Accessed: 2025-09-14
work page 2017
-
[75]
Grootendorst, M. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794
work page internal anchor Pith review arXiv 2022
-
[76]
Give Everybody [..] a Little Bit More Equity
Kingsley, S.; Sinha, P.; Wang, C.; Eslami, M.; and Hong, J. I. 2022. "Give Everybody [..] a Little Bit More Equity": Content Creator Perspectives and Responses to the Algorithmic Demonetization of Content Associated with Disadvantaged Groups. Proc. ACM Hum.-Comput. Interact., 6(CSCW2)
work page 2022
-
[77]
A.; Bisbee, J.; Bonneau, R.; Tucker, J
Lai, A.; Brown, M. A.; Bisbee, J.; Bonneau, R.; Tucker, J. A.; and Nagler, J. 2022. Estimating the Ideology of Political YouTube Videos. SSRN Working Paper. Posted May 2, 2022. Available at https://ssrn.com/abstract=4088828
work page 2022
-
[78]
McGrady, R.; Zheng, K.; Curran, R.; Baumgartner, J.; and Zuckerman, E. 2023. Dialing for Videos: A Random Sample of YouTube. Journal of Quantitative Description: Digital Media, 3
work page 2023
-
[79]
P.; Druschel, P.; and Bhattacharjee, B
Mislove, A.; Marcon, M.; Gummadi, K. P.; Druschel, P.; and Bhattacharjee, B. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, IMC '07, 29–42. New York, NY, USA: Association for Computing Machinery. ISBN 9781595939081
work page 2007
-
[80]
Oxford University Press . 2025. The Oxford Word of the Year 2025 is rage bait. Accessed: 2026-01-15
work page 2025
-
[81]
\" O zkula, S. M.; Reilly, P. J.; and Hayes, J. 2023. Easy data, same old platforms? A systematic review of digital activism methodologies. Information, Communication & Society, 26(7): 1470--1489
work page 2023
-
[82]
Park, M.; Naaman, M.; and Berger, J. 2021. A Data-Driven Study of View Duration on YouTube. Proceedings of the International AAAI Conference on Web and Social Media, 10(1): 651--654
work page 2021
-
[83]
H.; Ottoni, R.; West, R.; Almeida, V
Ribeiro, M. H.; Ottoni, R.; West, R.; Almeida, V. A. F.; and Meira, W. 2020. Auditing radicalization pathways on YouTube. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* '20, 131–141. New York, NY, USA: Association for Computing Machinery. ISBN 9781450369367
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.