arxiv: 2605.12153 · v1 · submitted 2026-05-12 · 💻 cs.SE · cs.AI

Recognition: no theorem link

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

Vladislav Savenkov

Pith reviewed 2026-05-13 04:12 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords industrial code datasetproprietary software repositoriessoftware engineering researchcode intelligencelarge-scale code corpusanonymized version controlproduction codebases

0 comments

The pith

CIDR supplies 2,440 proprietary industrial repositories totaling 373 million lines of code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Curated Industrial Developer Repository, a dataset of real-world proprietary codebases gathered directly from 12 industrial partner organizations under formal sharing agreements. It contains 2,440 repositories written in 138 programming languages and amounting to 373 million lines of code, each accompanied by structured metadata on topics and domains such as enterprise web development and fintech. Unlike prior collections drawn from public open-source platforms, this resource consists exclusively of production code that has undergone automated filtering, manual review, and full-history anonymization. The dataset is positioned to support work on code intelligence, software quality measurement, language model pre-training, developer behavior analysis, and agent evaluation benchmarks.

Core claim

What carries the argument

The multi-stage curation pipeline that combines partner onboarding, automated metadata filtering plus manual code review, and deterministic anonymization across the entire version control history.

If this is right

Models for code intelligence can be trained and evaluated on production-scale industrial code rather than public open-source examples.
Software quality studies gain access to real version histories and metadata from enterprise settings.
Developer behavior research can examine patterns inside proprietary codebases that are normally inaccessible.
Agent benchmarks can incorporate realistic tasks drawn from actual industrial repositories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Direct performance comparisons between models trained on CIDR and on public datasets could quantify how much industrial code differs from open-source code in practice.
The curation approach may serve as a template for other organizations to release anonymized internal code while retaining research value.
Multilingual coverage across 138 languages could support development of cross-language tools that reflect industrial rather than academic usage patterns.

Load-bearing premise

The combination of automated filtering, manual review, and anonymization keeps the selected repositories representative of industrial code without substantial loss of research-relevant information or introduction of bias.

What would settle it

An experiment showing that language models trained on CIDR achieve no measurable improvement over models trained on public open-source datasets when tested on proprietary code tasks would undermine the claim of unique research utility.

read the original abstract

We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per-repository metadata. Unlike existing code corpora derived from public open-source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy. All repositories were processed through a multi-stage pipeline encompassing structured partner onboarding, two-stage quality selection combining automated metadata filtering with manual code review, and a deterministic anonymization pipeline covering the full version control history. The dataset is intended to support research in code intelligence, software quality analysis, pre-training and fine-tuning of code language models, developer behaviour studies, and construction of agent evaluation benchmarks. Access is provided under a restricted commercial license; details are available at https://fermatix.ai/#Contact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CIDR gives access to a large set of proprietary industrial code from real partners, but the paper offers no numbers or checks to show the curation pipeline keeps the data representative and useful.

read the letter

This paper's main contribution is a dataset of proprietary industrial code collected directly from 12 partner organizations under formal agreements. It covers 2,440 repositories, 138 languages, and 373 million lines of code, with per-repository metadata. The authors describe a pipeline that includes partner onboarding, automated metadata filtering plus manual review for quality, and deterministic anonymization of the full version control history. They intend it for code intelligence, quality analysis, model pre-training, developer studies, and agent benchmarks, with access under a restricted commercial license. That source and scale set it apart from the usual open-source corpora scraped from public platforms.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the Curated Industrial Developer Repository (CIDR), a dataset comprising 2,440 proprietary repositories collected from 12 industrial partners under formal agreements. These span 138 programming languages and total 373 million lines of code. The paper outlines a multi-stage processing pipeline including partner onboarding, two-stage quality selection (automated metadata filtering plus manual code review), and deterministic anonymization of full VCS histories. The dataset is positioned to support code intelligence, software quality analysis, language model pre-training, developer behavior studies, and agent benchmarks, with access under a restricted commercial license.

Significance. If the pipeline claims are substantiated, CIDR would be a valuable addition to software engineering resources by supplying large-scale, real-world industrial code that is rarely available outside open-source platforms. The collaboration with 12 organizations and the reported scale (373 MLOC across diverse domains such as fintech and enterprise development) represent clear strengths that could improve realism in downstream tasks like model fine-tuning and quality analysis compared to public corpora.

major comments (2)

[§3.2] §3.2 (Quality Selection): The description of the two-stage quality selection (automated metadata filtering followed by manual review) provides no quantitative validation such as per-stage rejection rates, inter-rater agreement statistics for manual review, or before/after distributional comparisons (e.g., LOC per language, commit frequency, or cyclomatic complexity). Without these, it is not possible to confirm that the process preserves representativeness and research utility rather than introducing selection bias.
[§4] §4 (Anonymization Pipeline): The deterministic anonymization is stated to cover the full VCS history, but the manuscript supplies no empirical checks (e.g., preservation of commit patterns, identifier semantics, or task-relevant signals) demonstrating that stripping does not degrade utility for code intelligence or quality analysis tasks.

minor comments (2)

[Dataset Statistics] The abstract and dataset statistics section report aggregate figures but lack a per-language or per-domain breakdown table, which would help readers assess balance and potential biases.
[Abstract] Clarify the exact metadata fields provided per repository and their completeness, as the abstract mentions 'structured per-repository metadata' without enumeration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper accordingly to improve clarity and substantiation of the described pipeline.

read point-by-point responses

Referee: [§3.2] §3.2 (Quality Selection): The description of the two-stage quality selection (automated metadata filtering followed by manual review) provides no quantitative validation such as per-stage rejection rates, inter-rater agreement statistics for manual review, or before/after distributional comparisons (e.g., LOC per language, commit frequency, or cyclomatic complexity). Without these, it is not possible to confirm that the process preserves representativeness and research utility rather than introducing selection bias.

Authors: We agree that additional quantitative details would strengthen the description of the quality selection process and help readers assess potential bias. In the revised manuscript we will report the per-stage rejection rates from automated metadata filtering, the number of repositories subjected to manual review, and any available inter-rater agreement metrics. We will also add before-and-after distributional comparisons (LOC per language, commit frequency) on the full collection to demonstrate that the filtering steps preserve the overall characteristics of the contributed industrial codebases. revision: yes
Referee: [§4] §4 (Anonymization Pipeline): The deterministic anonymization is stated to cover the full VCS history, but the manuscript supplies no empirical checks (e.g., preservation of commit patterns, identifier semantics, or task-relevant signals) demonstrating that stripping does not degrade utility for code intelligence or quality analysis tasks.

Authors: We acknowledge that empirical validation of utility preservation after anonymization would be valuable. In the revision we will include sample-based checks comparing commit-pattern statistics and identifier-usage distributions before and after anonymization. We will also expand the discussion of the deterministic replacement strategy to explain how consistent identifier mapping across the full VCS history maintains semantic relationships and task-relevant signals for downstream code-intelligence and quality-analysis workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive dataset paper with no derivations or fitted claims

full rationale

The paper presents a new industrial code dataset collected via partner agreements and processed through a described pipeline of onboarding, filtering, review, and anonymization. No equations, predictions, first-principles derivations, or parameter fittings appear anywhere in the manuscript. The central claims are factual descriptions of collection size, languages covered, and access terms; they do not reduce to self-referential inputs or self-citations by construction. The quality-preservation assumption noted by the reader is an evidentiary gap (lack of quantitative validation metrics), not a circularity in any derivation chain. Per the guidelines, a self-contained descriptive paper without load-bearing reductions receives score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper presents a new dataset without introducing mathematical parameters, axioms, or new entities beyond the dataset itself and its described processing steps.

pith-pipeline@v0.9.0 · 5470 in / 1110 out tokens · 95862 ms · 2026-05-13T04:12:04.331124+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Repository Selection and Quality Criteria Each submitted repository undergoes a two-stage selection process: automated filtering by metadata, followed by manual review of the source code. The two stages are complementary rather than redundant — automated filtering eliminates structurally unsuitable repositories at low cost, while manual review applies qua...

work page
[2]

Anonymization Anonymization was applied to every accepted repository prior to release using repo-sanitizer , an open-source command-line utility developed specifically for this pipeline and available at https://github.com/Fermatix/repo-sanitizer. The tool operates on the complete repository — including the working tree at the time of submission and the fu...

work page 1918
[3]

This section describes the tooling developed to track repositories through their lifecycle and to facilitate structured interaction with contributing partners

Data Management and Accounting System As the scale of the collection effort grew, the need for a robust data management system became apparent. This section describes the tooling developed to track repositories through their lifecycle and to facilitate structured interaction with contributing partners. 7.1 Evolution of Tooling In the initial phase, reposi...

work page
[4]

Ethical Considerations and Licensing 8.1 Partner Consent and Data Agreements All participating organizations entered into a Master Source Code License and Sublicensing Agreement with Fermatix AI prior to submitting any repositories. The agreement governs the full lifecycle of the contributed material: the scope and baseline of the repository transfer, the...

work page 2016
[5]

All figures and extended tables referenced below are available in Appendix A

Dataset Statistics and Analysis This section provides a quantitative characterization of the accepted repositories in CIDR. All figures and extended tables referenced below are available in Appendix A. 9.1 Language Distribution CIDR exhibits a pronounced concentration in two languages: PHP accounts for 40.3% of all lines of code across accepted repositori...

work page 2011
[6]

Quality Assurance Quality control is applied at each stage of the collection and processing pipeline rather than as a single terminal step, ensuring that errors are identified and addressed as early as possible. At the metadata stage, automated checks verify that each record is complete, internally consistent, and free of anomalous values — for example, a...

work page
[7]

• Anonymization may affect the utility of commit messages and documentation strings for natural language tasks

Intended Use Cases and Limitations 11.1 Intended Use Cases • Pre-training and fine-tuning of code language models • Software defect prediction and code quality research • Developer behaviour and software evolution studies • Evaluation of static analysis and code review tools • Construction of SWE-bench-style evaluation benchmarks for autonomous coding age...

work page
[8]

Evaluating Large Language Models Trained on Code

Conclusion CIDR is a large-scale, industry-sourced software repository dataset constructed through a structured collection, curation, and anonymization pipeline developed specifically for this purpose. By partnering directly with industrial organizations, we have been able to assemble a dataset that complements existing open-source collections and provide...

work page internal anchor Pith review Pith/arXiv arXiv 2012