Linked Crunchbase: A Linked Data API and RDF Data Set About Innovative Companies

Michael F\"arber

arxiv: 1907.08671 · v1 · pith:U4ERLZ7Wnew · submitted 2019-07-19 · 💻 cs.DB · cs.AI· cs.IR

Linked Crunchbase: A Linked Data API and RDF Data Set About Innovative Companies

Michael F\"arber This is my paper

Pith reviewed 2026-05-24 18:42 UTC · model grok-4.3

classification 💻 cs.DB cs.AIcs.IR

keywords Linked DataRDFKnowledge GraphCrunchbaseAPIStartupsData Integration

0 comments

The pith

Crunchbase data on companies, people and investments has been converted to an RDF knowledge graph of over 347 million triples and exposed through a Linked Data API.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to turn the proprietary Crunchbase platform into a public Web of Data resource by building a Linked Data API that serves the original records in RDF. It details the mapping process, the addition of sameAs links to other sources, and the subsequent crawl that assembles a full knowledge graph. A sympathetic reader would see this as a concrete way to make unique startup and investment data queryable with standard Semantic Web tools instead of remaining locked inside one website. The published dataset contains 781k people, 659k organizations and 343k investments described in 347 million triples.

Core claim

We developed and hosted a Linked Data API for Crunchbase and integrated sameAs links to other data sources. We then crawled RDF data based on this API to build a custom Crunchbase RDF knowledge graph. We created an RDF data set with over 347 million triples, including 781k people, 659k organizations, and 343k investments. Our Crunchbase Linked Data API is available online at http://linked-crunchbase.org.

What carries the argument

The Linked Data API for Crunchbase, which both serves the data in RDF and serves as the entry point for crawling the full knowledge graph.

If this is right

The data becomes usable by anyone on the Web in machine-readable RDF format.
sameAs links allow the dataset to be integrated with other linked open data sources.
Standard SPARQL queries can now be run directly against the Crunchbase content.
The knowledge graph can be kept current by re-crawling the API.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other closed data platforms could follow the same API-plus-crawl pattern to increase their reach.
The resulting graph could support large-scale studies of investment networks that combine Crunchbase with public financial or patent data.
Downstream applications might treat the triples as a live, queryable index rather than a static dump.

Load-bearing premise

Crunchbase's internal data model can be mapped to RDF without significant loss of meaning or accuracy, and the resulting triples faithfully represent the original records.

What would settle it

A sample-by-sample comparison between the generated RDF triples and the original Crunchbase records that would reveal any systematic loss or distortion in the mapping.

Figures

Figures reproduced from arXiv: 1907.08671 by Michael F\"arber.

**Figure 2.** Figure 2: UML sequence diagram illustrating the use of the wrapper. The wrapper [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Subset of the classes and object properties of the Crunchbase schema. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Subgraph of our Crunchbase knowledge graph showing information about a company. Entity type # Instances News 5,845,188 Jobs 3,611,335 Websites 2,282,952 People 780,727 Organizations 658,963 Addresses 447,705 Investments 342,547 Degrees 276,653 Funding Rounds 222,244 Acquisitions 77,105 IPOs 16,037 Locations 12,211 Funds 9,349 Categories 739 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Statistics of our obtained Crunchbase RDF data set. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Crunchbase is an online platform collecting information about startups and technology companies, including attributes and relations of companies, people, and investments. Data contained in Crunchbase is, to a large extent, not available elsewhere, making Crunchbase to a unique data source. In this paper, we present how to bring Crunchbase to the Web of Data so that its data can be used in the machine-readable RDF format by anyone on the Web. First, we give insights into how we developed and hosted a Linked Data API for Crunchbase and how sameAs links to other data sources are integrated. Then, we present our method for crawling RDF data based on this API to build a custom Crunchbase RDF knowledge graph. We created an RDF data set with over 347 million triples, including 781k people, 659k organizations, and 343k investments. Our Crunchbase Linked Data API is available online at http://linked-crunchbase.org.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases a public RDF dump and API for Crunchbase data, which is new but thin on how the conversion was actually done.

read the letter

The main thing to know is that the authors have converted Crunchbase into RDF, produced a 347-million-triple dataset, and put up a Linked Data API at linked-crunchbase.org. No prior work had done exactly this with this source, so the resource itself is new. They also added some sameAs links to other datasets, which is a standard but helpful step for integration work. The reported scale—781k people, 659k organizations, 343k investments—shows they processed a real volume of data, and making a commercial source available in this format is the practical win. For anyone who needs startup or investment records in machine-readable linked data form, this saves the effort of building the conversion from scratch. The soft spot is the missing detail on the actual work. The abstract mentions developing the API and crawling the RDF, but gives no mapping rules, ontology choices, deduplication steps, or checks for completeness and accuracy. That leaves the central assumption—that Crunchbase’s model translates to RDF without major distortion—unexamined in the provided text. A reader cannot tell how much meaning was lost or how faithfully the triples match the original records. This is a data-release paper aimed at the semantic web community and anyone doing knowledge-graph work on companies. It will not reshape methods or answer open research questions, but the dataset could be directly useful for experiments. I would bring it to a reading group if the group is surveying available linked datasets. I would not cite the paper in my own work, but I might download and use the data if the site holds up. It deserves peer review because the resource is concrete and previously unavailable; the methods section just needs more substance to make the claim fully verifiable.

Referee Report

2 major / 1 minor

Summary. The manuscript presents the development of a Linked Data API for Crunchbase (a platform for startup, company, people, and investment data) together with a crawling procedure that produces an RDF knowledge graph containing over 347 million triples (781k people, 659k organizations, 343k investments). sameAs links to external sources are added, and both the API and a snapshot are made publicly available at http://linked-crunchbase.org.

Significance. If the conversion is faithful, the work supplies a large-scale, publicly accessible RDF resource for a domain whose data are otherwise unavailable in machine-readable linked form. The explicit release of both the live API and the 347 M-triple snapshot, together with cross-dataset links, constitutes a concrete contribution to the Linked Open Data cloud.

major comments (2)

[Abstract (data conversion paragraph)] Abstract, paragraph on data conversion: the claim that the generated triples 'faithfully represent the original records' is unsupported because no ontology, property-mapping rules, or validation procedure is described.
[Section on crawling RDF data] Section describing the crawling method: no information is supplied on deduplication logic, crawl completeness, handling of rate limits or pagination, or how the reported entity counts were obtained from the API responses.

minor comments (1)

The abstract states that 'insights' into API development are given, yet the manuscript supplies no concrete technical details (endpoint structure, authentication, response formats) that would allow reproduction or independent use of the API.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address the two major comments point by point below.

read point-by-point responses

Referee: [Abstract (data conversion paragraph)] Abstract, paragraph on data conversion: the claim that the generated triples 'faithfully represent the original records' is unsupported because no ontology, property-mapping rules, or validation procedure is described.

Authors: We acknowledge that the abstract asserts faithful representation without describing the ontology, mapping rules, or validation. The revised manuscript will expand the data conversion section to detail the ontology (including vocabularies used), the explicit property-mapping rules from Crunchbase fields to RDF, and the validation steps performed. The abstract will be revised to qualify or remove the unsupported claim. revision: yes
Referee: [Section on crawling RDF data] Section describing the crawling method: no information is supplied on deduplication logic, crawl completeness, handling of rate limits or pagination, or how the reported entity counts were obtained from the API responses.

Authors: We agree that these implementation details are absent. The revised manuscript will add descriptions of the deduplication logic, how crawl completeness was evaluated, the handling of rate limits and pagination, and the precise method used to derive the reported entity counts from API responses. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a data engineering paper describing the construction of an RDF dataset and Linked Data API from an external commercial source (Crunchbase). The abstract and available text report a mapping process, crawling, and release of 347M triples with counts of entities, but contain no equations, fitted parameters, predictions, uniqueness theorems, or self-citations that could reduce any claim to its own inputs by construction. The central claim (public availability of the API and snapshot) is externally falsifiable and does not rely on internal derivation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard RDF and Linked Data conventions plus access to Crunchbase's proprietary database; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Crunchbase data can be represented faithfully in RDF without loss of key relations
Invoked when describing the conversion and crawling process
standard math Standard Linked Data practices (sameAs links, dereferenceable URIs) apply directly to company records
Used to integrate with other data sources

pith-pipeline@v0.9.0 · 5690 in / 1259 out tokens · 36145 ms · 2026-05-24T18:42:14.221664+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

In: Digital Startups in Transition Economies

Skala, A.: Characteristics of Startups. In: Digital Startups in Transition Economies. Springer (2019) 41–91

work page 2019
[2]

In: OECD Science, Technology and Industry Working Papers

Dalle, J.M., den Besten, M., Menon, C.: Using Crunchbase for economic and managerial research. In: OECD Science, Technology and Industry Working Papers. OECD Publishing (2017)

work page 2017
[3]

Master’s thesis, Aalto University (2016)

Meril¨ ainen, K.: Success factors in corporate startup accelerators. Master’s thesis, Aalto University (2016)

work page 2016
[4]

Ewens, M., Townsend, R.: Are Early Stage Investors Biased Against Women? Journal of Financial Economics (JFE) (2018)

work page 2018
[5]

Semantic Web 9(4) (2018) 505–515

F¨ arber, M., Menne, C., Harth, A.: A linked data wrapper for crunchbase. Semantic Web 9(4) (2018) 505–515

work page 2018
[6]

In: Proceedings of the 10th International Conference on Business Information Systems

Mochol, M., Wache, H., Nixon, L.: Improving the Accuracy of Job Search with Semantic Techniques. In: Proceedings of the 10th International Conference on Business Information Systems. BIS’07, Springer (2007) 301–313

work page 2007
[7]

In: Proceedings of the 13th Extended Semantic Web Conference

F¨ arber, M., Rettinger, A., Harth, A.: Towards Monitoring of Novel Statements in the News. In: Proceedings of the 13th Extended Semantic Web Conference. ESWC 2016, Springer (2016) 285–299

work page 2016
[8]

In: Proceedings of the 22nd International Conference on World Wide Web

Stadtm¨ uller, S., Speiser, S., Harth, A., Studer, R.: Data-Fu: A Language and an Interpreter for Interaction with Read/Write Linked Data. In: Proceedings of the 22nd International Conference on World Wide Web. WWW’13 (2013) 1225–1236

work page 2013
[9]

Semantic Web 5(3) (2014) 173–176

Janowicz, K., Hitzler, P., Adams, B., Kolas, D., Vardeman, C.: Five stars of Linked Data vocabulary use. Semantic Web 5(3) (2014) 173–176

work page 2014
[10]

In: Proceedings of the 4th International Conference on Consuming Linked Data

Harth, A., Knoblock, C.A., Stadtm¨ uller, S., Studer, R., Szekely, P.: On-the-ﬂy Integration of Static and Dynamic Linked Data. In: Proceedings of the 4th International Conference on Consuming Linked Data. COLD’13 (2013) 1–12

work page 2013
[11]

In: Proceedings of the 15th International Conference on Informatics and Semiotics in Organisations

Lee, V., Goto, M., Hu, B., Naseer, A., Vandenbussche, P., Shakair, G., Rodrigues, E.M.: Exploiting Linked Data in Financial Engineering. In: Proceedings of the 15th International Conference on Informatics and Semiotics in Organisations. ICISO’14. (2014) 116–125

work page 2014
[12]

In: Proceedings of the 6th International AAAI Conference on Weblogs and Social Media

Xiang, G., Zheng, Z., Wen, M., Hong, J.I., Ros´ e, C.P., Liu, C.: A Supervised Approach to Predict Company Acquisition with Factual and Topic Features Using Proﬁles and News Articles on TechCrunch. In: Proceedings of the 6th International AAAI Conference on Weblogs and Social Media. ICWSM’12 (2012) 607–610

work page 2012
[13]

Internet Research 26(1) (2016) 74–100 15

Liang, Y.E., Yuan, S.D.: Predicting investor funding behavior using crunchbase social network features. Internet Research 26(1) (2016) 74–100 15

work page 2016

[1] [1]

In: Digital Startups in Transition Economies

Skala, A.: Characteristics of Startups. In: Digital Startups in Transition Economies. Springer (2019) 41–91

work page 2019

[2] [2]

In: OECD Science, Technology and Industry Working Papers

Dalle, J.M., den Besten, M., Menon, C.: Using Crunchbase for economic and managerial research. In: OECD Science, Technology and Industry Working Papers. OECD Publishing (2017)

work page 2017

[3] [3]

Master’s thesis, Aalto University (2016)

Meril¨ ainen, K.: Success factors in corporate startup accelerators. Master’s thesis, Aalto University (2016)

work page 2016

[4] [4]

Ewens, M., Townsend, R.: Are Early Stage Investors Biased Against Women? Journal of Financial Economics (JFE) (2018)

work page 2018

[5] [5]

Semantic Web 9(4) (2018) 505–515

F¨ arber, M., Menne, C., Harth, A.: A linked data wrapper for crunchbase. Semantic Web 9(4) (2018) 505–515

work page 2018

[6] [6]

In: Proceedings of the 10th International Conference on Business Information Systems

Mochol, M., Wache, H., Nixon, L.: Improving the Accuracy of Job Search with Semantic Techniques. In: Proceedings of the 10th International Conference on Business Information Systems. BIS’07, Springer (2007) 301–313

work page 2007

[7] [7]

In: Proceedings of the 13th Extended Semantic Web Conference

F¨ arber, M., Rettinger, A., Harth, A.: Towards Monitoring of Novel Statements in the News. In: Proceedings of the 13th Extended Semantic Web Conference. ESWC 2016, Springer (2016) 285–299

work page 2016

[8] [8]

In: Proceedings of the 22nd International Conference on World Wide Web

Stadtm¨ uller, S., Speiser, S., Harth, A., Studer, R.: Data-Fu: A Language and an Interpreter for Interaction with Read/Write Linked Data. In: Proceedings of the 22nd International Conference on World Wide Web. WWW’13 (2013) 1225–1236

work page 2013

[9] [9]

Semantic Web 5(3) (2014) 173–176

Janowicz, K., Hitzler, P., Adams, B., Kolas, D., Vardeman, C.: Five stars of Linked Data vocabulary use. Semantic Web 5(3) (2014) 173–176

work page 2014

[10] [10]

In: Proceedings of the 4th International Conference on Consuming Linked Data

Harth, A., Knoblock, C.A., Stadtm¨ uller, S., Studer, R., Szekely, P.: On-the-ﬂy Integration of Static and Dynamic Linked Data. In: Proceedings of the 4th International Conference on Consuming Linked Data. COLD’13 (2013) 1–12

work page 2013

[11] [11]

In: Proceedings of the 15th International Conference on Informatics and Semiotics in Organisations

Lee, V., Goto, M., Hu, B., Naseer, A., Vandenbussche, P., Shakair, G., Rodrigues, E.M.: Exploiting Linked Data in Financial Engineering. In: Proceedings of the 15th International Conference on Informatics and Semiotics in Organisations. ICISO’14. (2014) 116–125

work page 2014

[12] [12]

In: Proceedings of the 6th International AAAI Conference on Weblogs and Social Media

Xiang, G., Zheng, Z., Wen, M., Hong, J.I., Ros´ e, C.P., Liu, C.: A Supervised Approach to Predict Company Acquisition with Factual and Topic Features Using Proﬁles and News Articles on TechCrunch. In: Proceedings of the 6th International AAAI Conference on Weblogs and Social Media. ICWSM’12 (2012) 607–610

work page 2012

[13] [13]

Internet Research 26(1) (2016) 74–100 15

Liang, Y.E., Yuan, S.D.: Predicting investor funding behavior using crunchbase social network features. Internet Research 26(1) (2016) 74–100 15

work page 2016