CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph
Pith reviewed 2026-06-30 06:02 UTC · model grok-4.3
The pith
Cortex organizes web-scale corpora into a three-layer Ontological Corpus Graph instead of flat document lists.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cortex elevates web-scale corpus construction from flat document filtering to structured knowledge organization through an Ontological Corpus Graph (OCG), a three-layer heterogeneous structure unifying a quality-refined content layer, a hierarchical lightweight ontology layer via LLM-driven automated evolution, and a cross-domain alignment layer enabling inter-domain association at arbitrary taxonomic resolution.
What carries the argument
The Ontological Corpus Graph (OCG), a three-layer structure that combines quality-refined documents, an LLM-evolved concept hierarchy, and cross-domain links to enable systematic organization.
If this is right
- Quality-refined content improves data for LLM training stages.
- Hierarchical ontology enables systematic knowledge organization at scale.
- Cross-domain alignment supports synthesis of data across domains at flexible resolution.
- The resulting CortexBench benchmark tests search-and-reasoning across eight frontier LLMs.
- The full 24.14B-token corpus and OCG will be released publicly.
Where Pith is reading between the lines
- The approach could support more precise data selection for specific model capabilities like reasoning or domain expertise.
- It might reduce redundancy in training data by using the ontology to deduplicate at the concept level.
- Similar graph structures could be tested on non-web sources such as books or scientific papers to check broader applicability.
- If the alignment layer works at arbitrary resolution, it could enable fine-grained mixing of domains for targeted fine-tuning.
- keywords:[
Load-bearing premise
The LLM-driven automated evolution produces a reliable hierarchical lightweight ontology that accurately captures and organizes the corpus content without substantial errors, biases, or need for human correction.
What would settle it
Manual review of the generated ontology hierarchy shows frequent mismatches with actual corpus content, or models trained on the Cortex corpus show no gains over flat corpora on cross-domain reasoning tasks.
Figures
read the original abstract
The continuous evolution of large language models drives escalating demands on data scale and quality, and as different training stages impose increasingly tailored data requirements, systematic organization of high-quality corpora becomes indispensable. Existing corpus construction pipelines confine the resulting corpora to flat, undifferentiated document collections, universally lacking systematic knowledge organization. We present Cortex, to our knowledge the first framework that elevates web-scale corpus construction from flat document filtering to structured knowledge organization through an Ontological Corpus Graph (OCG), a three-layer heterogeneous structure unifying a quality-refined content layer, a hierarchical lightweight ontology layer via LLM-driven automated evolution, and a cross-domain alignment layer enabling inter-domain association at arbitrary taxonomic resolution. Comprehensive experiments confirm the effectiveness of Cortex. In particular, we leverage the OCG to synthesize CortexBench, a cross-domain search-and-reasoning benchmark whose evaluation across eight frontier LLMs validates the effectiveness of quality refinement, domain organization, and cross-domain data synthesis. We will publicly release the complete codebase, a 24.14B-token refined corpus with its OCG, and CortexBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Cortex, claimed as the first framework to move web-scale corpus construction from flat document filtering to structured knowledge organization via an Ontological Corpus Graph (OCG). The OCG is a three-layer heterogeneous structure: a quality-refined content layer, a hierarchical lightweight ontology layer built through LLM-driven automated evolution, and a cross-domain alignment layer for inter-domain associations at arbitrary taxonomic resolution. Experiments synthesize CortexBench (a cross-domain search-and-reasoning benchmark) and evaluate it across eight frontier LLMs to validate quality refinement, domain organization, and cross-domain synthesis; the authors plan to release the full codebase, a 24.14B-token refined corpus with its OCG, and CortexBench.
Significance. If the OCG construction and LLM-driven ontology evolution prove reliable, the work would advance corpus construction by enabling systematic knowledge organization at scale, potentially improving data quality for LLM training stages with tailored requirements and supporting cross-domain reasoning. The planned public release of code, corpus, OCG, and benchmark is a clear strength for reproducibility and community use.
major comments (1)
- [Abstract and Experiments section] Abstract and Experiments section: validation of the hierarchical lightweight ontology layer (the load-bearing middle layer of the OCG) occurs only indirectly via downstream performance of eight LLMs on CortexBench. No direct accuracy metrics (e.g., precision/recall vs. human annotations, inter-annotator agreement, consistency checks, or error rates on the evolved ontology) are described, leaving the central assumption that LLM-driven automated evolution produces a reliable ontology untested at the source.
minor comments (1)
- [Abstract] Abstract: the claim of being 'to our knowledge the first' would benefit from a brief related-work contrast even at abstract length; the statement that experiments 'confirm the effectiveness' lacks any mention of baselines or controls.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract and Experiments section] Abstract and Experiments section: validation of the hierarchical lightweight ontology layer (the load-bearing middle layer of the OCG) occurs only indirectly via downstream performance of eight LLMs on CortexBench. No direct accuracy metrics (e.g., precision/recall vs. human annotations, inter-annotator agreement, consistency checks, or error rates on the evolved ontology) are described, leaving the central assumption that LLM-driven automated evolution produces a reliable ontology untested at the source.
Authors: We agree that validation of the hierarchical ontology layer is indirect. Direct metrics such as precision/recall against human annotations or inter-annotator agreement are not reported because creating reliable human ground truth for an LLM-evolved ontology over a 24.14B-token web-scale corpus is computationally and financially prohibitive. CortexBench was explicitly constructed to test the ontology's utility through cross-domain search and reasoning tasks; consistent performance gains across eight frontier LLMs provide evidence that the evolved structure supports the intended knowledge organization. We will add an explicit limitations paragraph discussing the indirect nature of this validation in the revised manuscript. revision: partial
Circularity Check
No circularity: framework construction is self-contained engineering description
full rationale
The paper presents Cortex as a methodological framework for building an Ontological Corpus Graph (OCG) via LLM-driven ontology evolution and cross-domain alignment, with validation performed through downstream experiments on CortexBench across eight LLMs. No equations, fitted parameters, or predictions are described that reduce any claim to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim of elevating corpus construction to structured organization is advanced as an original engineering contribution rather than derived from prior self-referential steps, making the derivation chain independent of the patterns that trigger circularity flags.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can perform reliable automated evolution of a hierarchical lightweight ontology from web-scale corpus content.
invented entities (1)
-
Ontological Corpus Graph (OCG)
no independent evidence
Reference graph
Works this paper leans on
-
[7]
TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora , booktitle =
Priyanka Kargupta and Nan Zhang and Yunyi Zhang and Rui Zhang and Prasenjit Mitra and Jiawei Han , editor =. TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora , booktitle =. 2025 , url =
2025
-
[10]
The RefinedWeb Dataset for Falcon
Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Hamza Alobeidli and Alessandro Cappelli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay , editor =. The RefinedWeb Dataset for Falcon. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIP...
2023
-
[11]
QuRating: Selecting High-Quality Data for Training Language Models , booktitle =
Alexander Wettig and Aatmik Gupta and Saumya Malik and Danqi Chen , editor =. QuRating: Selecting High-Quality Data for Training Language Models , booktitle =. 2024 , url =
2024
-
[18]
FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering , booktitle =
Erik Henriksson and Otto Tarkka and Filip Ginter , editor =. FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering , booktitle =. 2025 , url =
2025
-
[19]
Common Crawl Corpus , year =
-
[21]
Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng
Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Scott Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F. Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan ...
2024
-
[25]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =
Guilherme Penedo and Hynek Kydl. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =. 2024 , url =
2024
-
[26]
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset , booktitle =
Dan Su and Kezhi Kong and Ying Lin and Joseph Jennings and Brandon Norick and Markus Kliegl and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , editor =. Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset , booktitle =. 2025 , url =
2025
-
[29]
Breakthroughs in statistics: Methodology and distribution , pages=
Robust estimation of a location parameter , author=. Breakthroughs in statistics: Methodology and distribution , pages=. 1992 , publisher=
1992
-
[32]
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing ,
Rada Mihalcea and Paul Tarau , title =. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing ,. 2004 , url =
2004
-
[33]
A statistical interpretation of term specificity and its application in retrieval , journal =
Karen Sp. A statistical interpretation of term specificity and its application in retrieval , journal =. 2004 , url =. doi:10.1108/00220410410560573 , timestamp =
-
[36]
Qwen Team , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.09388 , eprinttype =. 2505.09388 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[39]
Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen
Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =
2022
-
[40]
Proceedings of the 29th Symposium on Operating Systems Principles , pages =
Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =
-
[42]
Yuzhen Huang and Yuzhuo Bai and Zhihao Zhu and Junlei Zhang and Jinghan Zhang and Tangjun Su and Junteng Liu and Chuancheng Lv and Yikai Zhang and Jiayi Lei and Yao Fu and Maosong Sun and Junxian He , editor =. C-Eval:. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New ...
2023
-
[43]
Findings of the Association for Computational Linguistics,
Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao and Yeyun Gong and Nan Duan and Timothy Baldwin , editor =. Findings of the Association for Computational Linguistics,. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-ACL.671 , timestamp =
-
[44]
9th International Conference on Learning Representations,
Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =
2021
-
[45]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. CoRR , volume =. 2018 , url =. 1803.05457 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[46]
OpenCompass: A Universal Evaluation Platform for Foundation Models , author=
-
[47]
Robertson and Steve Walker , editor =
Stephen E. Robertson and Steve Walker , editor =. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval , booktitle =. 1994 , url =. doi:10.1007/978-1-4471-2099-5\_24 , timestamp =
-
[49]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2512.02556 , eprinttype =. 2512.02556 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556 2025
-
[50]
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=
-
[51]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2501.12948 , eprinttype =. 2501.12948 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[52]
ReLE: Really Reliable Live Evaluation for Chinese LLMs , year =
-
[54]
Narasimhan and Yuan Cao , title =
Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
2023
-
[55]
Liu , title =
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , volume =. 2020 , url =
2020
-
[58]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[59]
Publications Manual , year = "1983", publisher =
1983
-
[60]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[61]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[63]
2018 , howpublished=
Chinese Stopwords Corpus , author=. 2018 , howpublished=
2018
-
[64]
Dan Gusfield , title =. 1997
1997
-
[65]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[66]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[67]
Adrien Barbaresi. 2021. https://doi.org/10.18653/V1/2021.ACL-DEMO.15 Trafilatura: A web scraping library and command-line tool for text discovery and extraction . In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL...
- [68]
-
[69]
Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2018. https://doi.org/10.1007/978-3-319-76941-7\_83 Elastic chatnoir: Search engine for the clueweb and the common crawl . In Advances in Information Retrieval - 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings , Lecture Notes in Compute...
-
[70]
Wenzhi Cao, Vahid Mirjalili, and Sebastian Raschka. 2020. https://doi.org/10.1016/J.PATREC.2020.11.008 Rank consistent ordinal regression for neural networks with application to age estimation . Pattern Recognit. Lett., 140:325--331
-
[71]
Jianghao Chen, Pu Jian, Tengxiao Xi, Yidong Yi, Qianlong Du, Chenglin Ding, Guibo Zhu, Chengqing Zong, Jinqiao Wang, and Jiajun Zhang. 2023. https://doi.org/10.48550/ARXIV.2311.01149 Chinesewebtext: Large-scale high-quality chinese web text extracted with effective evaluation model . CoRR, abs/2311.01149
-
[72]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.48550/ARXIV.2402.03216 BGE m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation . CoRR, abs/2402.03216
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03216 2024
-
[73]
Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. 2025. https://doi.org/10.48550/ARXIV.2508.06600 Browsecomp-plus: A more fair and transparent eva...
-
[74]
Common crawl corpus
Common Crawl Foundation . Common crawl corpus. https://commoncrawl.org/. Accessed: 2025-11-20
2025
-
[75]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. https://doi.org/10.18653/V1/2020.FINDINGS-EMNLP.58 Revisiting pre-trained models for chinese natural language processing . In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 , Findings of ACL , pages 657--668. Associati...
-
[76]
DeepSeek - AI. 2024. https://doi.org/10.48550/ARXIV.2412.19437 Deepseek-v3 technical report . CoRR, abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[77]
Fabrizio Gilardi, Meysam Alizadeh, and Ma \" e l Kubli. 2023. https://doi.org/10.48550/ARXIV.2303.15056 Chatgpt outperforms crowd-workers for text-annotation tasks . CoRR, abs/2303.15056
-
[78]
goto456. 2018. Chinese stopwords corpus. https://github.com/goto456/stopwords
2018
-
[79]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C \' e sar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S \' e bastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. https://doi.org/10.48550/ARXIV.2306....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.11644 2023
-
[80]
Erik Henriksson, Otto Tarkka, and Filip Ginter. 2025. https://aclanthology.org/2025.nodalida-1.27/ Finerweb-10bt: Refining web data with llm-based line-level filtering . In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies, NoDaLiDa/Baltic-HLT 2025, Tallinn, Estonia, Marc...
2025
-
[81]
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. https://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . CoRR, abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[82]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. https://doi.org/10.48550/...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556 2022
-
[83]
Peter J Huber. 1992. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492--518. Springer
1992
-
[84]
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/V1/2020.ACL-MAIN.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pages 6282--6293. A...
-
[85]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tom \' a s Mikolov. 2017. https://doi.org/10.18653/V1/E17-2068 Bag of tricks for efficient text classification . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers , pages 427-...
-
[86]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://arxiv.org/abs/2001.08361 Scaling laws for neural language models . CoRR, abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[87]
Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, and Jiawei Han. 2025. https://aclanthology.org/2025.acl-long.1442/ Taxoadapt: Aligning llm-based multidimensional taxonomy construction to evolving research corpora . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL...
2025
-
[88]
Yang Lei, Jiangtong Li, Ming Jiang, Junjie Hu, Dawei Cheng, Zhijun Ding, and Changjun Jiang. 2023. https://doi.org/10.48550/ARXIV.2311.05812 Cfbenchmark: Chinese financial assistant benchmark for large language model . CoRR, abs/2311.05812
-
[89]
Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, and 40 others
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Scott Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee F. Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, and 40 others. 2024. http://papers.nips.cc/paper\_files/paper/202...
2024
-
[90]
Foerster, Roberta Raileanu, and Maria Lomeli
Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi - Yu, Jason Weston, Jakob N. Foerster, Roberta Raileanu, and Maria Lomeli. 2024. https://doi.org/10.48550/ARXIV.2409.08239 Source2synth: Synthetic data generation and curation grounded in real data sources . CoRR, abs/2409.08239
-
[91]
Rada Mihalcea and Paul Tarau. 2004. https://aclanthology.org/W04-3252/ Textrank: Bringing order into text . In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing , EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain , pages 404--411. ACL
2004
-
[92]
OpenAI. 2026. https://doi.org/10.48550/ARXIV.2601.03267 Openai GPT-5 system card . CoRR, abs/2601.03267
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267 2026
-
[93]
Raffel, Leandro von Werra, and Thomas Wolf
Guilherme Penedo, Hynek Kydl \' cek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. 2024. http://papers.nips.cc/paper\_files/paper/2024/hash/370df50ccfdf8bde18f8f9c2d9151bda-Abstract-Datasets\_and\_Benchmarks\_Track.html The fineweb datasets: Decanting the web for the finest text data at scale . In...
2024
-
[94]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/fa3ed726cc5073b9c31e3e49a807789c-Abstract-Datasets\_and\_Benchmarks.html The refinedweb dataset for falcon LLM: outperforming curated ...
2023
-
[95]
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. https://doi.org/10.18653/V1/2020.ACL-DEMOS.14 Stanza: A python natural language processing toolkit for many human languages . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5-10, 2020...
-
[96]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res., 21:140:1--140:67
2020
-
[97]
ReLE Benchmark Team . 2025. https://github.com/jeinlee1991/chinese-llm-benchmark Rele: Really reliable live evaluation for chinese llms
2025
-
[98]
Xintong Shi, Wenzhi Cao, and Sebastian Raschka. 2023. https://doi.org/10.1007/S10044-023-01181-9 Deep neural networks for rank-consistent ordinal regression based on conditional probabilities . Pattern Anal. Appl., 26(3):941--955
-
[99]
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. 2025. https://aclanthology.org/2025.acl-long.123/ Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset . In Proceedings of the 63rd Annual Meeting of the Association for Computational Lingu...
2025
-
[100]
Llama Team. 2024. https://doi.org/10.48550/ARXIV.2407.21783 The llama 3 herd of models . CoRR, abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[101]
Martin Thoma. 2018. https://doi.org/10.5281/zenodo.841984 WiLI-2018 - Wikipedia Language Identification database
-
[102]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. https://doi.org/10.1162/TACL\_A\_00475 Musique: Multihop questions via single-hop question composition . Trans. Assoc. Comput. Linguistics, 10:539--554
work page internal anchor Pith review doi:10.1162/tacl 2022
-
[103]
Liangdong Wang, Bowen Zhang, Chengwei Wu, Hanyu Zhao, Xiaofeng Shi, Shuhao Gu, Jijie Li, Quanyue Ma, Tengfei Pan, and Guang Liu. 2024. https://doi.org/10.48550/ARXIV.2410.18505 CCI3.0-HQ: a large-scale chinese dataset of high quality designed for pre-training large language models . CoRR, abs/2410.18505
-
[104]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.754 Self-instruct: Aligning language models with self-generated instructions . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A...
-
[105]
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. 2025. https://doi.org/10.48550/ARXIV.2504.12516 Browsecomp: A simple yet challenging benchmark for browsing agents . CoRR, abs/2504.12516
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.12516 2025
-
[106]
Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. 2024. https://proceedings.mlr.press/v235/wettig24a.html Qurating: Selecting high-quality data for training language models . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of Machine Learning Research, pages 52915--52971. ...
2024
-
[107]
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. https://doi.org/10.48550/ARXIV.2309.07597 C-pack: Packaged resources to advance general chinese embedding . CoRR, abs/2309.07597
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.07597 2023
-
[108]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 others. 2024. https://doi.org/10.48550/ARXIV.2412.15115 Qwen2.5 technical report . CoRR, abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
-
[109]
Cohen, Ruslan Salakhutdinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/V1/D18-1259 Hotpotqa: A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October...
-
[110]
Narasimhan, and Yuan Cao
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE\_vluYUL-X React: Synergizing reasoning and acting in language models . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net
2023
-
[111]
Qingkai Zeng, Yuyang Bai, Zhaoxuan Tan, Shangbin Feng, Zhenwen Liang, Zhihan Zhang, and Meng Jiang. 2024. https://doi.org/10.1145/3627673.3679608 Chain-of-layer: Iteratively prompting large language models for taxonomy induction from limited examples . In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM 20...
-
[112]
Sadler, Michelle Vanni, and Jiawei Han
Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian M. Sadler, Michelle Vanni, and Jiawei Han. 2018. https://doi.org/10.1145/3219819.3220064 Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering . In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.