Recognition: no theorem link
CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research
Pith reviewed 2026-05-13 04:12 UTC · model grok-4.3
The pith
CIDR supplies 2,440 proprietary industrial repositories totaling 373 million lines of code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per-repository metadata. Unlike existing code corpora derived from public open-source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy.
What carries the argument
The multi-stage curation pipeline that combines partner onboarding, automated metadata filtering plus manual code review, and deterministic anonymization across the entire version control history.
If this is right
- Models for code intelligence can be trained and evaluated on production-scale industrial code rather than public open-source examples.
- Software quality studies gain access to real version histories and metadata from enterprise settings.
- Developer behavior research can examine patterns inside proprietary codebases that are normally inaccessible.
- Agent benchmarks can incorporate realistic tasks drawn from actual industrial repositories.
Where Pith is reading between the lines
- Direct performance comparisons between models trained on CIDR and on public datasets could quantify how much industrial code differs from open-source code in practice.
- The curation approach may serve as a template for other organizations to release anonymized internal code while retaining research value.
- Multilingual coverage across 138 languages could support development of cross-language tools that reflect industrial rather than academic usage patterns.
Load-bearing premise
The combination of automated filtering, manual review, and anonymization keeps the selected repositories representative of industrial code without substantial loss of research-relevant information or introduction of bias.
What would settle it
An experiment showing that language models trained on CIDR achieve no measurable improvement over models trained on public open-source datasets when tested on proprietary code tasks would undermine the claim of unique research utility.
read the original abstract
We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per-repository metadata. Unlike existing code corpora derived from public open-source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy. All repositories were processed through a multi-stage pipeline encompassing structured partner onboarding, two-stage quality selection combining automated metadata filtering with manual code review, and a deterministic anonymization pipeline covering the full version control history. The dataset is intended to support research in code intelligence, software quality analysis, pre-training and fine-tuning of code language models, developer behaviour studies, and construction of agent evaluation benchmarks. Access is provided under a restricted commercial license; details are available at https://fermatix.ai/#Contact.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the Curated Industrial Developer Repository (CIDR), a dataset comprising 2,440 proprietary repositories collected from 12 industrial partners under formal agreements. These span 138 programming languages and total 373 million lines of code. The paper outlines a multi-stage processing pipeline including partner onboarding, two-stage quality selection (automated metadata filtering plus manual code review), and deterministic anonymization of full VCS histories. The dataset is positioned to support code intelligence, software quality analysis, language model pre-training, developer behavior studies, and agent benchmarks, with access under a restricted commercial license.
Significance. If the pipeline claims are substantiated, CIDR would be a valuable addition to software engineering resources by supplying large-scale, real-world industrial code that is rarely available outside open-source platforms. The collaboration with 12 organizations and the reported scale (373 MLOC across diverse domains such as fintech and enterprise development) represent clear strengths that could improve realism in downstream tasks like model fine-tuning and quality analysis compared to public corpora.
major comments (2)
- [§3.2] §3.2 (Quality Selection): The description of the two-stage quality selection (automated metadata filtering followed by manual review) provides no quantitative validation such as per-stage rejection rates, inter-rater agreement statistics for manual review, or before/after distributional comparisons (e.g., LOC per language, commit frequency, or cyclomatic complexity). Without these, it is not possible to confirm that the process preserves representativeness and research utility rather than introducing selection bias.
- [§4] §4 (Anonymization Pipeline): The deterministic anonymization is stated to cover the full VCS history, but the manuscript supplies no empirical checks (e.g., preservation of commit patterns, identifier semantics, or task-relevant signals) demonstrating that stripping does not degrade utility for code intelligence or quality analysis tasks.
minor comments (2)
- [Dataset Statistics] The abstract and dataset statistics section report aggregate figures but lack a per-language or per-domain breakdown table, which would help readers assess balance and potential biases.
- [Abstract] Clarify the exact metadata fields provided per repository and their completeness, as the abstract mentions 'structured per-repository metadata' without enumeration.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper accordingly to improve clarity and substantiation of the described pipeline.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Quality Selection): The description of the two-stage quality selection (automated metadata filtering followed by manual review) provides no quantitative validation such as per-stage rejection rates, inter-rater agreement statistics for manual review, or before/after distributional comparisons (e.g., LOC per language, commit frequency, or cyclomatic complexity). Without these, it is not possible to confirm that the process preserves representativeness and research utility rather than introducing selection bias.
Authors: We agree that additional quantitative details would strengthen the description of the quality selection process and help readers assess potential bias. In the revised manuscript we will report the per-stage rejection rates from automated metadata filtering, the number of repositories subjected to manual review, and any available inter-rater agreement metrics. We will also add before-and-after distributional comparisons (LOC per language, commit frequency) on the full collection to demonstrate that the filtering steps preserve the overall characteristics of the contributed industrial codebases. revision: yes
-
Referee: [§4] §4 (Anonymization Pipeline): The deterministic anonymization is stated to cover the full VCS history, but the manuscript supplies no empirical checks (e.g., preservation of commit patterns, identifier semantics, or task-relevant signals) demonstrating that stripping does not degrade utility for code intelligence or quality analysis tasks.
Authors: We acknowledge that empirical validation of utility preservation after anonymization would be valuable. In the revision we will include sample-based checks comparing commit-pattern statistics and identifier-usage distributions before and after anonymization. We will also expand the discussion of the deterministic replacement strategy to explain how consistent identifier mapping across the full VCS history maintains semantic relationships and task-relevant signals for downstream code-intelligence and quality-analysis workloads. revision: yes
Circularity Check
No circularity: purely descriptive dataset paper with no derivations or fitted claims
full rationale
The paper presents a new industrial code dataset collected via partner agreements and processed through a described pipeline of onboarding, filtering, review, and anonymization. No equations, predictions, first-principles derivations, or parameter fittings appear anywhere in the manuscript. The central claims are factual descriptions of collection size, languages covered, and access terms; they do not reduce to self-referential inputs or self-citations by construction. The quality-preservation assumption noted by the reader is an evidentiary gap (lack of quantitative validation metrics), not a circularity in any derivation chain. Per the guidelines, a self-contained descriptive paper without load-bearing reductions receives score 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Repository Selection and Quality Criteria Each submitted repository undergoes a two-stage selection process: automated filtering by metadata, followed by manual review of the source code. The two stages are complementary rather than redundant — automated filtering eliminates structurally unsuitable repositories at low cost, while manual review applies qua...
-
[2]
Anonymization Anonymization was applied to every accepted repository prior to release using repo-sanitizer , an open-source command-line utility developed specifically for this pipeline and available at https://github.com/Fermatix/repo-sanitizer. The tool operates on the complete repository — including the working tree at the time of submission and the fu...
work page 1918
-
[3]
Data Management and Accounting System As the scale of the collection effort grew, the need for a robust data management system became apparent. This section describes the tooling developed to track repositories through their lifecycle and to facilitate structured interaction with contributing partners. 7.1 Evolution of Tooling In the initial phase, reposi...
-
[4]
Ethical Considerations and Licensing 8.1 Partner Consent and Data Agreements All participating organizations entered into a Master Source Code License and Sublicensing Agreement with Fermatix AI prior to submitting any repositories. The agreement governs the full lifecycle of the contributed material: the scope and baseline of the repository transfer, the...
work page 2016
-
[5]
All figures and extended tables referenced below are available in Appendix A
Dataset Statistics and Analysis This section provides a quantitative characterization of the accepted repositories in CIDR. All figures and extended tables referenced below are available in Appendix A. 9.1 Language Distribution CIDR exhibits a pronounced concentration in two languages: PHP accounts for 40.3% of all lines of code across accepted repositori...
work page 2011
-
[6]
Quality Assurance Quality control is applied at each stage of the collection and processing pipeline rather than as a single terminal step, ensuring that errors are identified and addressed as early as possible. At the metadata stage, automated checks verify that each record is complete, internally consistent, and free of anomalous values — for example, a...
-
[7]
Intended Use Cases and Limitations 11.1 Intended Use Cases • Pre-training and fine-tuning of code language models • Software defect prediction and code quality research • Developer behaviour and software evolution studies • Evaluation of static analysis and code review tools • Construction of SWE-bench-style evaluation benchmarks for autonomous coding age...
-
[8]
Evaluating Large Language Models Trained on Code
Conclusion CIDR is a large-scale, industry-sourced software repository dataset constructed through a structured collection, curation, and anonymization pipeline developed specifically for this purpose. By partnering directly with industrial organizations, we have been able to assemble a dataset that complements existing open-source collections and provide...
work page internal anchor Pith review Pith/arXiv arXiv 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.