pith. sign in

arxiv: 2606.30175 · v1 · pith:34P566VUnew · submitted 2026-06-29 · 💻 cs.CL

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

Pith reviewed 2026-06-30 06:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords corpusorganizationcross-domaincorporacortexdataknowledgelayer
0
0 comments X

The pith

Cortex organizes web-scale corpora into a three-layer Ontological Corpus Graph instead of flat document lists.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Cortex as a framework that moves beyond simple filtering of web documents into flat collections. It builds an Ontological Corpus Graph with a cleaned content layer, an automatically evolved hierarchy of concepts created by LLMs, and a layer that aligns concepts across different domains at any level of detail. This structure supports more tailored data needs for different stages of large language model training. Experiments include building a 24 billion token corpus and testing a new benchmark called CortexBench on eight frontier models to check quality, organization, and cross-domain synthesis.

Core claim

Cortex elevates web-scale corpus construction from flat document filtering to structured knowledge organization through an Ontological Corpus Graph (OCG), a three-layer heterogeneous structure unifying a quality-refined content layer, a hierarchical lightweight ontology layer via LLM-driven automated evolution, and a cross-domain alignment layer enabling inter-domain association at arbitrary taxonomic resolution.

What carries the argument

The Ontological Corpus Graph (OCG), a three-layer structure that combines quality-refined documents, an LLM-evolved concept hierarchy, and cross-domain links to enable systematic organization.

If this is right

  • Quality-refined content improves data for LLM training stages.
  • Hierarchical ontology enables systematic knowledge organization at scale.
  • Cross-domain alignment supports synthesis of data across domains at flexible resolution.
  • The resulting CortexBench benchmark tests search-and-reasoning across eight frontier LLMs.
  • The full 24.14B-token corpus and OCG will be released publicly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could support more precise data selection for specific model capabilities like reasoning or domain expertise.
  • It might reduce redundancy in training data by using the ontology to deduplicate at the concept level.
  • Similar graph structures could be tested on non-web sources such as books or scientific papers to check broader applicability.
  • If the alignment layer works at arbitrary resolution, it could enable fine-grained mixing of domains for targeted fine-tuning.
  • keywords:[

Load-bearing premise

The LLM-driven automated evolution produces a reliable hierarchical lightweight ontology that accurately captures and organizes the corpus content without substantial errors, biases, or need for human correction.

What would settle it

Manual review of the generated ontology hierarchy shows frequent mismatches with actual corpus content, or models trained on the Cortex corpus show no gains over flat corpora on cross-domain reasoning tasks.

Figures

Figures reproduced from arXiv: 2606.30175 by Chengtao Gan, Huajun Chen, Songze Li, Wen Zhang, Xiaoke Guo, Yushan Zhu, Zhaoyan Gong, Zhiqiang Liu.

Figure 1
Figure 1. Figure 1: Overview of the CORTEX framework. and target-class samples; the full weighting scheme is detailed in Appendix F. Weighted Huber Regression Loss. To ro￾bustly fit the teacher’s continuous scores, we employ a weighted Huber loss (Huber, 1992): Lreg = 1 N PN n=1 wn Huberδ(ˆyn − yn), where Huberδ(e) = 1 2 e 2 if |e| ≤ δ and δ(|e| − 1 2 δ) other￾wise. Soft-Threshold Ordinal Boundary Loss. Let t1 < t2 denote the… view at source ↗
Figure 2
Figure 2. Figure 2: Sliding-window update rate rt (Eq. 9) during concept chain evolution. N.4 Concept Expansion Protocol Each concept chain is expanded by an LLM into a natural-language Chinese paragraph of approxi￾mately 300–500 characters, covering definitions or core topics, the scope of subtopics, common discus￾sion dimensions or scenarios, and related keywords. The system prompt explicitly forbids named enti￾ties (specif… view at source ↗
read the original abstract

The continuous evolution of large language models drives escalating demands on data scale and quality, and as different training stages impose increasingly tailored data requirements, systematic organization of high-quality corpora becomes indispensable. Existing corpus construction pipelines confine the resulting corpora to flat, undifferentiated document collections, universally lacking systematic knowledge organization. We present Cortex, to our knowledge the first framework that elevates web-scale corpus construction from flat document filtering to structured knowledge organization through an Ontological Corpus Graph (OCG), a three-layer heterogeneous structure unifying a quality-refined content layer, a hierarchical lightweight ontology layer via LLM-driven automated evolution, and a cross-domain alignment layer enabling inter-domain association at arbitrary taxonomic resolution. Comprehensive experiments confirm the effectiveness of Cortex. In particular, we leverage the OCG to synthesize CortexBench, a cross-domain search-and-reasoning benchmark whose evaluation across eight frontier LLMs validates the effectiveness of quality refinement, domain organization, and cross-domain data synthesis. We will publicly release the complete codebase, a 24.14B-token refined corpus with its OCG, and CortexBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Cortex, claimed as the first framework to move web-scale corpus construction from flat document filtering to structured knowledge organization via an Ontological Corpus Graph (OCG). The OCG is a three-layer heterogeneous structure: a quality-refined content layer, a hierarchical lightweight ontology layer built through LLM-driven automated evolution, and a cross-domain alignment layer for inter-domain associations at arbitrary taxonomic resolution. Experiments synthesize CortexBench (a cross-domain search-and-reasoning benchmark) and evaluate it across eight frontier LLMs to validate quality refinement, domain organization, and cross-domain synthesis; the authors plan to release the full codebase, a 24.14B-token refined corpus with its OCG, and CortexBench.

Significance. If the OCG construction and LLM-driven ontology evolution prove reliable, the work would advance corpus construction by enabling systematic knowledge organization at scale, potentially improving data quality for LLM training stages with tailored requirements and supporting cross-domain reasoning. The planned public release of code, corpus, OCG, and benchmark is a clear strength for reproducibility and community use.

major comments (1)
  1. [Abstract and Experiments section] Abstract and Experiments section: validation of the hierarchical lightweight ontology layer (the load-bearing middle layer of the OCG) occurs only indirectly via downstream performance of eight LLMs on CortexBench. No direct accuracy metrics (e.g., precision/recall vs. human annotations, inter-annotator agreement, consistency checks, or error rates on the evolved ontology) are described, leaving the central assumption that LLM-driven automated evolution produces a reliable ontology untested at the source.
minor comments (1)
  1. [Abstract] Abstract: the claim of being 'to our knowledge the first' would benefit from a brief related-work contrast even at abstract length; the statement that experiments 'confirm the effectiveness' lacks any mention of baselines or controls.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] Abstract and Experiments section: validation of the hierarchical lightweight ontology layer (the load-bearing middle layer of the OCG) occurs only indirectly via downstream performance of eight LLMs on CortexBench. No direct accuracy metrics (e.g., precision/recall vs. human annotations, inter-annotator agreement, consistency checks, or error rates on the evolved ontology) are described, leaving the central assumption that LLM-driven automated evolution produces a reliable ontology untested at the source.

    Authors: We agree that validation of the hierarchical ontology layer is indirect. Direct metrics such as precision/recall against human annotations or inter-annotator agreement are not reported because creating reliable human ground truth for an LLM-evolved ontology over a 24.14B-token web-scale corpus is computationally and financially prohibitive. CortexBench was explicitly constructed to test the ontology's utility through cross-domain search and reasoning tasks; consistent performance gains across eight frontier LLMs provide evidence that the evolved structure supports the intended knowledge organization. We will add an explicit limitations paragraph discussing the indirect nature of this validation in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: framework construction is self-contained engineering description

full rationale

The paper presents Cortex as a methodological framework for building an Ontological Corpus Graph (OCG) via LLM-driven ontology evolution and cross-domain alignment, with validation performed through downstream experiments on CortexBench across eight LLMs. No equations, fitted parameters, or predictions are described that reduce any claim to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim of elevating corpus construction to structured organization is advanced as an original engineering contribution rather than derived from prior self-referential steps, making the derivation chain independent of the patterns that trigger circularity flags.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs can reliably automate ontology evolution and on the new construct of the OCG itself; no free parameters or external benchmarks are mentioned in the abstract.

axioms (1)
  • domain assumption LLMs can perform reliable automated evolution of a hierarchical lightweight ontology from web-scale corpus content.
    Invoked for the ontology layer construction.
invented entities (1)
  • Ontological Corpus Graph (OCG) no independent evidence
    purpose: Unify quality-refined content, LLM-evolved ontology, and cross-domain alignment into a single heterogeneous structure for corpus organization.
    Central new entity introduced to replace flat document collections.

pith-pipeline@v0.9.1-grok · 5741 in / 1363 out tokens · 52675 ms · 2026-06-30T06:02:29.049058+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 41 canonical work pages · 16 internal anchors

  1. [7]

    TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora , booktitle =

    Priyanka Kargupta and Nan Zhang and Yunyi Zhang and Rui Zhang and Prasenjit Mitra and Jiawei Han , editor =. TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora , booktitle =. 2025 , url =

  2. [10]

    The RefinedWeb Dataset for Falcon

    Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Hamza Alobeidli and Alessandro Cappelli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay , editor =. The RefinedWeb Dataset for Falcon. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIP...

  3. [11]

    QuRating: Selecting High-Quality Data for Training Language Models , booktitle =

    Alexander Wettig and Aatmik Gupta and Saumya Malik and Danqi Chen , editor =. QuRating: Selecting High-Quality Data for Training Language Models , booktitle =. 2024 , url =

  4. [18]

    FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering , booktitle =

    Erik Henriksson and Otto Tarkka and Filip Ginter , editor =. FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering , booktitle =. 2025 , url =

  5. [19]

    Common Crawl Corpus , year =

  6. [21]

    Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng

    Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Scott Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F. Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan ...

  7. [25]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =

    Guilherme Penedo and Hynek Kydl. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =. 2024 , url =

  8. [26]

    Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset , booktitle =

    Dan Su and Kezhi Kong and Ying Lin and Joseph Jennings and Brandon Norick and Markus Kliegl and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , editor =. Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset , booktitle =. 2025 , url =

  9. [29]

    Breakthroughs in statistics: Methodology and distribution , pages=

    Robust estimation of a location parameter , author=. Breakthroughs in statistics: Methodology and distribution , pages=. 1992 , publisher=

  10. [32]

    Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing ,

    Rada Mihalcea and Paul Tarau , title =. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing ,. 2004 , url =

  11. [33]

    A statistical interpretation of term specificity and its application in retrieval , journal =

    Karen Sp. A statistical interpretation of term specificity and its application in retrieval , journal =. 2004 , url =. doi:10.1108/00220410410560573 , timestamp =

  12. [36]

    Qwen3 Technical Report

    Qwen Team , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.09388 , eprinttype =. 2505.09388 , timestamp =

  13. [39]

    Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen

    Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =

  14. [40]

    Proceedings of the 29th Symposium on Operating Systems Principles , pages =

    Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =

  15. [42]

    Yuzhen Huang and Yuzhuo Bai and Zhihao Zhu and Junlei Zhang and Jinghan Zhang and Tangjun Su and Junteng Liu and Chuancheng Lv and Yikai Zhang and Jiayi Lei and Yao Fu and Maosong Sun and Junxian He , editor =. C-Eval:. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New ...

  16. [43]

    Findings of the Association for Computational Linguistics,

    Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao and Yeyun Gong and Nan Duan and Timothy Baldwin , editor =. Findings of the Association for Computational Linguistics,. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-ACL.671 , timestamp =

  17. [44]

    9th International Conference on Learning Representations,

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =

  18. [45]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. CoRR , volume =. 2018 , url =. 1803.05457 , timestamp =

  19. [46]

    OpenCompass: A Universal Evaluation Platform for Foundation Models , author=

  20. [47]

    Robertson and Steve Walker , editor =

    Stephen E. Robertson and Steve Walker , editor =. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval , booktitle =. 1994 , url =. doi:10.1007/978-1-4471-2099-5\_24 , timestamp =

  21. [49]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2512.02556 , eprinttype =. 2512.02556 , timestamp =

  22. [50]

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=

  23. [51]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2501.12948 , eprinttype =. 2501.12948 , timestamp =

  24. [52]

    ReLE: Really Reliable Live Evaluation for Chinese LLMs , year =

  25. [54]

    Narasimhan and Yuan Cao , title =

    Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  26. [55]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , volume =. 2020 , url =

  27. [58]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  28. [59]

    Publications Manual , year = "1983", publisher =

  29. [60]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  30. [61]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  31. [63]

    2018 , howpublished=

    Chinese Stopwords Corpus , author=. 2018 , howpublished=

  32. [64]

    Dan Gusfield , title =. 1997

  33. [65]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  34. [66]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  35. [67]

    Adrien Barbaresi. 2021. https://doi.org/10.18653/V1/2021.ACL-DEMO.15 Trafilatura: A web scraping library and command-line tool for text discovery and extraction . In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL...

  36. [68]

    Janek Bevendorff, Martin Potthast, and Benno Stein. 2021. https://arxiv.org/abs/2112.03103 Fastwarc: Optimizing large-scale web archive analytics . CoRR, abs/2112.03103

  37. [69]

    Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2018. https://doi.org/10.1007/978-3-319-76941-7\_83 Elastic chatnoir: Search engine for the clueweb and the common crawl . In Advances in Information Retrieval - 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings , Lecture Notes in Compute...

  38. [70]

    Wenzhi Cao, Vahid Mirjalili, and Sebastian Raschka. 2020. https://doi.org/10.1016/J.PATREC.2020.11.008 Rank consistent ordinal regression for neural networks with application to age estimation . Pattern Recognit. Lett., 140:325--331

  39. [71]

    Jianghao Chen, Pu Jian, Tengxiao Xi, Yidong Yi, Qianlong Du, Chenglin Ding, Guibo Zhu, Chengqing Zong, Jinqiao Wang, and Jiajun Zhang. 2023. https://doi.org/10.48550/ARXIV.2311.01149 Chinesewebtext: Large-scale high-quality chinese web text extracted with effective evaluation model . CoRR, abs/2311.01149

  40. [72]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.48550/ARXIV.2402.03216 BGE m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation . CoRR, abs/2402.03216

  41. [73]

    Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. 2025. https://doi.org/10.48550/ARXIV.2508.06600 Browsecomp-plus: A more fair and transparent eva...

  42. [74]

    Common crawl corpus

    Common Crawl Foundation . Common crawl corpus. https://commoncrawl.org/. Accessed: 2025-11-20

  43. [75]

    Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. https://doi.org/10.18653/V1/2020.FINDINGS-EMNLP.58 Revisiting pre-trained models for chinese natural language processing . In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 , Findings of ACL , pages 657--668. Associati...

  44. [76]

    DeepSeek - AI. 2024. https://doi.org/10.48550/ARXIV.2412.19437 Deepseek-v3 technical report . CoRR, abs/2412.19437

  45. [77]

    Fabrizio Gilardi, Meysam Alizadeh, and Ma \" e l Kubli. 2023. https://doi.org/10.48550/ARXIV.2303.15056 Chatgpt outperforms crowd-workers for text-annotation tasks . CoRR, abs/2303.15056

  46. [78]

    goto456. 2018. Chinese stopwords corpus. https://github.com/goto456/stopwords

  47. [79]

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C \' e sar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S \' e bastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. https://doi.org/10.48550/ARXIV.2306....

  48. [80]

    Erik Henriksson, Otto Tarkka, and Filip Ginter. 2025. https://aclanthology.org/2025.nodalida-1.27/ Finerweb-10bt: Refining web data with llm-based line-level filtering . In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies, NoDaLiDa/Baltic-HLT 2025, Tallinn, Estonia, Marc...

  49. [81]

    Distilling the Knowledge in a Neural Network

    Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. https://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . CoRR, abs/1503.02531

  50. [82]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. https://doi.org/10.48550/...

  51. [83]

    Peter J Huber. 1992. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492--518. Springer

  52. [84]

    Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/V1/2020.ACL-MAIN.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pages 6282--6293. A...

  53. [85]

    Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tom \' a s Mikolov. 2017. https://doi.org/10.18653/V1/E17-2068 Bag of tricks for efficient text classification . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers , pages 427-...

  54. [86]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://arxiv.org/abs/2001.08361 Scaling laws for neural language models . CoRR, abs/2001.08361

  55. [87]

    Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, and Jiawei Han. 2025. https://aclanthology.org/2025.acl-long.1442/ Taxoadapt: Aligning llm-based multidimensional taxonomy construction to evolving research corpora . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL...

  56. [88]

    Yang Lei, Jiangtong Li, Ming Jiang, Junjie Hu, Dawei Cheng, Zhijun Ding, and Changjun Jiang. 2023. https://doi.org/10.48550/ARXIV.2311.05812 Cfbenchmark: Chinese financial assistant benchmark for large language model . CoRR, abs/2311.05812

  57. [89]

    Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, and 40 others

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Scott Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee F. Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, and 40 others. 2024. http://papers.nips.cc/paper\_files/paper/202...

  58. [90]

    Foerster, Roberta Raileanu, and Maria Lomeli

    Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi - Yu, Jason Weston, Jakob N. Foerster, Roberta Raileanu, and Maria Lomeli. 2024. https://doi.org/10.48550/ARXIV.2409.08239 Source2synth: Synthetic data generation and curation grounded in real data sources . CoRR, abs/2409.08239

  59. [91]

    Rada Mihalcea and Paul Tarau. 2004. https://aclanthology.org/W04-3252/ Textrank: Bringing order into text . In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing , EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain , pages 404--411. ACL

  60. [92]

    OpenAI. 2026. https://doi.org/10.48550/ARXIV.2601.03267 Openai GPT-5 system card . CoRR, abs/2601.03267

  61. [93]

    Raffel, Leandro von Werra, and Thomas Wolf

    Guilherme Penedo, Hynek Kydl \' cek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. 2024. http://papers.nips.cc/paper\_files/paper/2024/hash/370df50ccfdf8bde18f8f9c2d9151bda-Abstract-Datasets\_and\_Benchmarks\_Track.html The fineweb datasets: Decanting the web for the finest text data at scale . In...

  62. [94]

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/fa3ed726cc5073b9c31e3e49a807789c-Abstract-Datasets\_and\_Benchmarks.html The refinedweb dataset for falcon LLM: outperforming curated ...

  63. [95]

    Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. https://doi.org/10.18653/V1/2020.ACL-DEMOS.14 Stanza: A python natural language processing toolkit for many human languages . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5-10, 2020...

  64. [96]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res., 21:140:1--140:67

  65. [97]

    ReLE Benchmark Team . 2025. https://github.com/jeinlee1991/chinese-llm-benchmark Rele: Really reliable live evaluation for chinese llms

  66. [98]

    Xintong Shi, Wenzhi Cao, and Sebastian Raschka. 2023. https://doi.org/10.1007/S10044-023-01181-9 Deep neural networks for rank-consistent ordinal regression based on conditional probabilities . Pattern Anal. Appl., 26(3):941--955

  67. [99]

    Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. 2025. https://aclanthology.org/2025.acl-long.123/ Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset . In Proceedings of the 63rd Annual Meeting of the Association for Computational Lingu...

  68. [100]

    Llama Team. 2024. https://doi.org/10.48550/ARXIV.2407.21783 The llama 3 herd of models . CoRR, abs/2407.21783

  69. [101]

    Martin Thoma. 2018. https://doi.org/10.5281/zenodo.841984 WiLI-2018 - Wikipedia Language Identification database

  70. [102]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. https://doi.org/10.1162/TACL\_A\_00475 Musique: Multihop questions via single-hop question composition . Trans. Assoc. Comput. Linguistics, 10:539--554

  71. [103]

    Liangdong Wang, Bowen Zhang, Chengwei Wu, Hanyu Zhao, Xiaofeng Shi, Shuhao Gu, Jijie Li, Quanyue Ma, Tengfei Pan, and Guang Liu. 2024. https://doi.org/10.48550/ARXIV.2410.18505 CCI3.0-HQ: a large-scale chinese dataset of high quality designed for pre-training large language models . CoRR, abs/2410.18505

  72. [104]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.754 Self-instruct: Aligning language models with self-generated instructions . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A...

  73. [105]

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. 2025. https://doi.org/10.48550/ARXIV.2504.12516 Browsecomp: A simple yet challenging benchmark for browsing agents . CoRR, abs/2504.12516

  74. [106]

    Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. 2024. https://proceedings.mlr.press/v235/wettig24a.html Qurating: Selecting high-quality data for training language models . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of Machine Learning Research, pages 52915--52971. ...

  75. [107]

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. https://doi.org/10.48550/ARXIV.2309.07597 C-pack: Packaged resources to advance general chinese embedding . CoRR, abs/2309.07597

  76. [108]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 others. 2024. https://doi.org/10.48550/ARXIV.2412.15115 Qwen2.5 technical report . CoRR, abs/2412.15115

  77. [109]

    Cohen, Ruslan Salakhutdinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/V1/D18-1259 Hotpotqa: A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October...

  78. [110]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE\_vluYUL-X React: Synergizing reasoning and acting in language models . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net

  79. [111]

    Qingkai Zeng, Yuyang Bai, Zhaoxuan Tan, Shangbin Feng, Zhenwen Liang, Zhihan Zhang, and Meng Jiang. 2024. https://doi.org/10.1145/3627673.3679608 Chain-of-layer: Iteratively prompting large language models for taxonomy induction from limited examples . In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM 20...

  80. [112]

    Sadler, Michelle Vanni, and Jiawei Han

    Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian M. Sadler, Michelle Vanni, and Jiawei Han. 2018. https://doi.org/10.1145/3219819.3220064 Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering . In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2...