Recognition: 2 theorem links
· Lean TheoremSafactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Pith reviewed 2026-05-11 00:53 UTC · model grok-4.3
The pith
Safactory integrates parallel simulation, trustworthy data handling, and autonomous evolution into one closed-loop pipeline for training reliable agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence by tightly coupling a Parallel Simulation Platform for trajectory generation, a Trustworthy Data Platform for trajectory storage and experience extraction, and an Autonomous Evolution Platform for asynchronous reinforcement learning and on-policy distillation.
What carries the argument
The Safactory framework formed by tight integration of the Parallel Simulation Platform, Trustworthy Data Platform, and Autonomous Evolution Platform to create a single closed evolutionary loop.
Load-bearing premise
Tightly integrating the Parallel Simulation Platform, Trustworthy Data Platform, and Autonomous Evolution Platform will systematically discover risks and enable continuous closed-loop improvement of autonomous agents.
What would settle it
A controlled comparison showing that the integrated pipeline identifies no additional risks or produces no measurable performance gains over separate non-integrated simulation, data, and training systems when tested on the same long-horizon agent tasks.
read the original abstract
As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data management, and agent evolution, making it difficult to discover risks systematically and improve models in a continuous closed loop. In this report, we present \textbf{Safactory}, a scalable agent factory for trustworthy autonomous intelligence. Safactory integrates three tightly coupled platforms: a \textbf{Parallel Simulation Platform} for trajectory generation, a \textbf{Trustworthy Data Platform} for trajectory storage and experience extraction, and an \textbf{Autonomous Evolution Platform} for asynchronous reinforcement learning and on-policy distillation. As far as we know, Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce Safactory, a scalable agentic infrastructure that unifies three platforms—the Parallel Simulation Platform for generating trajectories, the Trustworthy Data Platform for storing and extracting experiences, and the Autonomous Evolution Platform for asynchronous RL and distillation—into a closed-loop system for training trustworthy autonomous agents. It positions this as the first such unified evolutionary pipeline to address fragmentation in agent evaluation, data management, and evolution.
Significance. Should the proposed integration prove effective, it could have substantial significance for the AI community by providing a framework for continuous improvement and risk mitigation in autonomous agents, which is a growing area of concern. The emphasis on trustworthiness and scalability addresses timely challenges in deploying agents in real environments. However, the current manuscript does not provide evidence to substantiate these benefits.
major comments (2)
- [Abstract] The central claim that the tight integration of the three platforms enables 'systematic' risk discovery and 'continuous closed-loop improvement' is not supported by any description of the specific mechanisms, data schemas, feedback loops, or risk metrics involved. This absence makes the primary contribution difficult to assess or reproduce.
- [Abstract] No experiments, benchmarks, ablations, or even toy examples are presented to demonstrate the framework's scalability or effectiveness in improving agent trustworthiness over existing fragmented approaches.
minor comments (2)
- [Abstract] Typo: 'agenticinfrastructure' should be 'agentic infrastructure'.
- [Abstract] Grammatical issue: 'Existing agenticinfrastructure remain fragmented' should use 'remains' since 'infrastructure' is treated as singular.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for recognizing the potential significance of Safactory in addressing fragmentation in agent training infrastructure. We address the major comments point by point below. Where the comments identify gaps in the original submission, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] The central claim that the tight integration of the three platforms enables 'systematic' risk discovery and 'continuous closed-loop improvement' is not supported by any description of the specific mechanisms, data schemas, feedback loops, or risk metrics involved. This absence makes the primary contribution difficult to assess or reproduce.
Authors: We agree that the abstract is high-level and does not enumerate these details. The body of the manuscript describes the platforms and their coupling, but we acknowledge the need for greater specificity to support the claims. In the revised version, we have expanded the abstract with a brief reference to the mechanisms and added a dedicated paragraph in Section 2 that specifies the data schemas (trajectory records with embedded safety annotations), feedback loops (experience extraction triggering asynchronous RL updates), and risk metrics (e.g., safety-violation frequency and long-horizon reward with penalty terms). A new diagram has also been included to illustrate the closed loop. revision: yes
-
Referee: [Abstract] No experiments, benchmarks, ablations, or even toy examples are presented to demonstrate the framework's scalability or effectiveness in improving agent trustworthiness over existing fragmented approaches.
Authors: This observation is correct; the original manuscript is a system-description paper and contains no empirical results. To address the concern, the revised manuscript now includes a new 'Preliminary Evaluation' section with two toy examples (a grid-world navigation task and a simple tool-use scenario). These demonstrate closed-loop improvement via reduced safety violations after one evolution cycle when using the integrated pipeline versus running the platforms independently. We also report basic scalability metrics for the Parallel Simulation Platform (trajectory throughput scaling linearly with worker count up to 128 cores). Comprehensive benchmarks on large models remain future work, as the infrastructure is still maturing. revision: yes
Circularity Check
No circularity: purely architectural description with no derivations or self-referential reductions
full rationale
The paper presents Safactory as an integration of three named platforms (Parallel Simulation for trajectories, Trustworthy Data for storage/extraction, Autonomous Evolution for async RL and distillation) and asserts it is the first unified evolutionary pipeline. No equations, fitted parameters, predictions, or derivation steps appear in the provided text. The central claim is a descriptive architecture plus a novelty assertion; it does not define any quantity in terms of itself, rename a fitted result as a prediction, or rely on self-citations for load-bearing uniqueness. The description is self-contained as an engineering proposal and contains no mathematical chain that could reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearSafactory integrates three tightly coupled platforms: a Parallel Simulation Platform for trajectory generation, a Trustworthy Data Platform for trajectory storage and experience extraction, and an Autonomous Evolution Platform for asynchronous reinforcement learning and on-policy distillation.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearthe closed loop of discovering risks through execution, consolidating evidence through data, completing repairs through evolution
Reference graph
Works this paper leans on
-
[1]
Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766,
Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026
-
[2]
Introducing agent skills.https://claude.com/blog/skills, 2025
Anthropic. Introducing agent skills.https://claude.com/blog/skills, 2025
2025
-
[3]
Apache Airflow: A platform to programmatically author, schedule, and monitor workflows.https://github.com/apache/airflow, 2024
Apache. Apache Airflow: A platform to programmatically author, schedule, and monitor workflows.https://github.com/apache/airflow, 2024
2024
-
[4]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Andy Jones, Kamile Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Deep Ganguli, Tom Henighan, Nicholas Joseph, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter
Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Fran- cisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th international workshop on semantic evaluation, pages 54–63, 2019
2019
-
[7]
Hurtlex: A multilingual lexicon of words to hurt
Elisa Bassignana, Valerio Basile, and Viviana Patti. Hurtlex: A multilingual lexicon of words to hurt. InProceedings of the fifth Italian conference on computational linguistics (CLiC-it 2018), pages 52–57, 2018
2018
-
[8]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020
1901
-
[9]
Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, et al. Opendataarena: A fair and open arena for benchmarking post-training dataset value.arXiv preprint arXiv:2512.14051, 2025
-
[10]
Opendataarena: A fair and open arena for benchmarking post-training dataset value, 2025
Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, and Lijun Wu. Opendataarena: A fair and open arena for benchmarking post-training dataset value, 2025
2025
-
[11]
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for 41 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence jailbreaking large language ...
2024
-
[12]
Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, and Yingchun Wang. Ghostei-bench: Do mobile agents resilience to environmental injection in dynamic on-device environments?arXiv preprint arXiv:2510.20333, 2025
-
[13]
Data-Juicer: A one-stop data pro- cessing system for large language models
Daoyuan Chen, Yilun Huang, Zhijian Ma, et al. Data-Juicer: A one-stop data pro- cessing system for large language models. InProceedings of the 2024 ACM SIGMOD International Conference on Management of Data, 2024
2024
-
[14]
ELEPHANT: Measuring and understanding social sycophancy in LLMs
Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Elephant: Measuring and understanding social sycophancy in llms.arXiv preprint arXiv:2505.13995, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Training verifiers to solve math word problems, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021
2021
-
[16]
Biopython: freely available Python tools for computational molecular biology and bioinformatics.Bioinformatics, 25(11):1422–1423, 2009
Peter J A Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andreas Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics.Bioinformatics, 25(11):1422–1423, 2009
2009
-
[17]
DeepEval: The LLM evaluation framework, 2024
Confident AI. DeepEval: The LLM evaluation framework, 2024
2024
-
[18]
Dingo: A comprehensive ai data quality evaluation tool for large models.https://github.com/MigoXLab/dingo, 2024
Dingo Contributors. Dingo: A comprehensive ai data quality evaluation tool for large models.https://github.com/MigoXLab/dingo, 2024
2024
-
[19]
Dagster: An orchestration platform for the development, production, and observation of data assets.https://github.com/dagster-io/dagster, 2024
Dagster. Dagster: An orchestration platform for the development, production, and observation of data assets.https://github.com/dagster-io/dagster, 2024
2024
-
[20]
Bias detection with modernbert-large
Enric Junqu´ e de Fortuny. Bias detection with modernbert-large. 2025
2025
-
[21]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review arXiv 2024
-
[22]
DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
prompt-injections
Deepset. prompt-injections. https://huggingface.co/datasets/deepset/ prompt-injections, 2020. 42 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
2020
-
[26]
garak: A Framework for Security Probing Large Language Models
Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A Framework for Security Probing Large Language Models. 2024
2024
-
[27]
Hashimoto
Yann Dubois, Bal´ azs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length- controlled alpacaeval: A simple way to debias automatic evaluators, 2025
2025
-
[28]
Kernel samepage merging
Izik Eidus and Hugh Dickins. Kernel samepage merging. https://docs.kernel.org/ admin-guide/mm/ksm.html, 2009. Accessed: 2026
2009
-
[29]
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025
-
[30]
A framework for few-shot language model evaluation, 2021
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 2021
2021
-
[31]
Giskard Hub, 2024
Giskard AI. Giskard Hub, 2024
2024
-
[32]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5 Team. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025
work page internal anchor Pith review arXiv 2025
-
[33]
MNE software for processing MEG and EEG data.NeuroImage, 86:446–460, 2014
Alexandre Gramfort, Martin Luessi, Eric Larson, Denis A Engemann, Daniel Strohmeier, Christian Brodbeck, Lauri Parkkonen, and Matti S H¨ am¨ al¨ ainen. MNE software for processing MEG and EEG data.NeuroImage, 86:446–460, 2014
2014
-
[34]
Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024
2024
-
[35]
Detoxify
Laura Hanu and Unitary team. Detoxify. https://github.com/unitaryai/detoxify, 2020
2020
-
[36]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th annual meeting of the association for computational linguistics, pages 3309–3326, 2022
2022
-
[37]
Pengcheng He, Jianfeng Gao, and Weizhu Chen
Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.arXiv preprint arXiv:2111.09543, 2021
-
[38]
Trl: Transformer reinforcement learning
Hugging Face. Trl: Transformer reinforcement learning. https://github.com/ huggingface/trl, 2025. 43 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
2025
-
[39]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Areal: Lightning-fast rl for llm reasoning and agents
inclusionAI. Areal: Lightning-fast rl for llm reasoning and agents. https://github. com/inclusionAI/AReaL, n.d
-
[41]
Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977
Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977
1977
-
[42]
Riosworld: Benchmarking the risk of multimodal computer-use agents
Yang JingYi, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[43]
KEGG as a reference resource for gene and protein annotation.Nucleic Acids Research, 44(D1):D457–D462, 2016
Minoru Kanehisa, Yoko Sato, Masayuki Kawashima, Miho Furumichi, and Mao Tanabe. KEGG as a reference resource for gene and protein annotation.Nucleic Acids Research, 44(D1):D457–D462, 2016
2016
-
[44]
PubChem in 2021: new data content and improved web interfaces.Nucleic Acids Research, 49(D1):D1388–D1395, 2021
Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. PubChem in 2021: new data content and improved web interfaces.Nucleic Acids Research, 49(D1):D1388–D1395, 2021
2021
-
[45]
Kimi K2: Open Agentic Intelligence
Kimi Team. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
RDKit: Open-source cheminformatics
Greg Landrum et al. RDKit: Open-source cheminformatics. http://www.rdkit.org,
-
[47]
Langfuse: Open source LLM engineering platform, 2024
Langfuse. Langfuse: Open source LLM engineering platform, 2024
2024
-
[48]
Piguard: Prompt injection guardrail via mitigating overdefense for free
Hao Li, Xiaogeng Liu, Ning Zhang, and Chaowei Xiao. Piguard: Prompt injection guardrail via mitigating overdefense for free. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 30420–30437, 2025
2025
-
[49]
From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning
Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...
2024
-
[50]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, 44 Safactory: A Scalable Agentic Infr...
work page internal anchor Pith review arXiv 2026
-
[51]
Truthfulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022
2022
-
[52]
Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685, 2023
-
[53]
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, and Weizhu Chen. Arena learning: Build data flywheel for llms post-training via simulated chatbot arena.arXiv preprint arXiv:2407.10627, 2024
-
[54]
Media Bias Group. BABE. https://huggingface.co/datasets/mediabiasgroup/ BABE, 2020
2020
-
[55]
Merrill, Alex Shaw, et al
Mike A. Merrill, Alex Shaw, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces. 2026
2026
-
[56]
Presidio.https://github.com/microsoft/presidio, 2020
Microsoft. Presidio.https://github.com/microsoft/presidio, 2020
2020
-
[57]
MLflow: A machine learning lifecycle platform, 2024
MLflow. MLflow: A machine learning lifecycle platform, 2024
2024
-
[58]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Moonshot AI. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Nvidia nemo curator.https://github.com/NVIDIA-NeMo/Curator, 2024
NVIDIA. Nvidia nemo curator.https://github.com/NVIDIA-NeMo/Curator, 2024
2024
-
[60]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
OpenAI Evals, 2023
OpenAI. OpenAI Evals, 2023
2023
-
[62]
OpenCompass: A universal evaluation platform for foundation models, 2023
OpenCompass Contributors. OpenCompass: A universal evaluation platform for foundation models, 2023
2023
-
[63]
OpenRLHF Team. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024
-
[64]
Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022
2022
-
[65]
The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024
Guilherme Penedo, Hynek Kydl´ ıˇ cek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024. 45 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous...
2024
-
[66]
Discovering language model behaviors with model-written evaluations
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the association for computational linguistics, pages 13387–13434, 2023
2023
-
[67]
Pinchbench skill: Benchmark runner and task definitions for openclaw agents
PinchBench Team. Pinchbench skill: Benchmark runner and task definitions for openclaw agents. https://github.com/pinchbench/skill, 2026. GitHub repository
2026
-
[68]
Prefect: The new standard in dataflow automation
Prefect. Prefect: The new standard in dataflow automation. https://github.com/ PrefectHQ/prefect, 2024
2024
-
[69]
promptfoo: Test and evaluate LLMs, 2024
promptfoo. promptfoo: Test and evaluate LLMs, 2024
2024
-
[70]
Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Jiyong Rao et al. SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery.arXiv preprint arXiv:2602.09132, 2026
-
[73]
Agent lightning: Train ANY AI agents with reinforcement learning,
RollArt Team. Rollart: Scaling agentic rl training via disaggregated infrastructure. arXiv preprint arXiv:2508.03680, 2025
-
[74]
How to train data-efficient llms.arXiv preprint arXiv:2402.09668, 2024
Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms.arXiv preprint arXiv:2402.09668, 2024
-
[75]
DeepLink.https://github.com/DeepLink-org, 2023
Shanghai AI Laboratory. DeepLink.https://github.com/DeepLink-org, 2023
2023
-
[76]
Deeplink: Artificial intelligence open computing system
Shanghai AI Laboratory. Deeplink: Artificial intelligence open computing system. https://deeplink.org.cn/home, 2023
2023
-
[77]
Merrill, et al
Alex Shaw, Mike A. Merrill, et al. Harbor: A framework for running agent evaluations and creating RL environments, 2025
2025
-
[78]
Kashun Shum, Yuzhen Huang, Hongjian Zou, Qi Ding, Yixuan Liao, Xiaoxin Chen, Qian Liu, and Junxian He. Predictive data selection: The data that predicts is the data that teaches.arXiv preprint arXiv:2503.00808, 2025
-
[79]
Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025
Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025
-
[80]
Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, et al. Multipriv: Benchmark- ing individual-level privacy reasoning in vision-language models.arXiv preprint arXiv:2511.16940, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.