pith. machine review for the scientific record. sign in

arxiv: 2406.20094 · v3 · submitted 2024-06-28 · 💻 cs.CL · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Authors on Pith no claims yet

Pith reviewed 2026-05-16 00:00 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords synthetic datapersonaslarge language modelsdata synthesisscalingweb curationreasoning problemsinstruction generation
0
0 comments X

The pith

A hub of one billion web-curated personas lets an LLM generate diverse synthetic data across math, instructions, knowledge texts, NPCs, and tools at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a persona-driven synthesis method that uses a massive collection of personas to draw out varied perspectives already latent inside a large language model. These personas, pulled automatically from web data, act as stand-ins for different slices of human experience and knowledge. When the model is prompted to respond from each persona's viewpoint, it produces synthetic examples that cover a wider range of topics and styles than conventional generation techniques. The authors demonstrate the approach on mathematical reasoning problems, user instructions, factual texts, game characters, and callable tools, showing that the same machinery works across domains without task-specific engineering.

Core claim

Persona Hub is a set of one billion diverse personas automatically extracted from the web; when used to condition an LLM, they function as distributed carriers of world knowledge that collectively surface almost every perspective the model has internalized, allowing high-volume creation of synthetic data for any scenario the authors test.

What carries the argument

Persona Hub: a collection of one billion personas automatically curated from web data that serve as role-play prompts to elicit different viewpoints from the same underlying LLM.

If this is right

  • Mathematical and logical reasoning problems can be created in bulk by having each persona pose or solve questions from its own background.
  • Instruction-tuning datasets become larger and more varied because each persona generates user-style prompts reflecting its own needs and language.
  • Knowledge-rich documents, game non-player characters, and executable tools can be synthesized on demand without writing separate prompts for each domain.
  • The same persona set works for many different data-generation tasks, removing the need to redesign pipelines when moving between applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the personas really capture broad human perspectives, the resulting data could reduce reliance on human annotators for alignment and capability training.
  • The method might extend to multimodal generation if personas are used to describe images, videos, or code that the model then produces.
  • A practical limit may appear once the number of unique personas exceeds the model's ability to distinguish them without collapse into generic outputs.

Load-bearing premise

Automatically collected web personas are diverse enough, unbiased enough, and faithfully simulable by the LLM that they produce new data rather than repetitive or hallucinated outputs.

What would settle it

Run the same generation tasks with Persona Hub versus a much smaller set of hand-written personas or unconditioned prompting, then measure output diversity and downstream task performance; if the billion-persona version shows no measurable gain in variety or quality, the scaling claim does not hold.

read the original abstract

We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Persona Hub, a collection of 1 billion personas automatically curated from web data, and proposes a persona-driven methodology to leverage LLMs for generating diverse synthetic data at scale. It demonstrates this approach through use cases including synthesis of mathematical and logical reasoning problems, user instructions, knowledge-rich texts, game NPCs, and tool functions, claiming that the personas act as distributed carriers of world knowledge to tap into nearly every perspective within the LLM.

Significance. If the central claim holds and the 1B personas provide broad, low-repetition coverage of perspectives without introducing substantial bias or hallucination, the work could offer a scalable and flexible framework for synthetic data creation that reduces reliance on human annotation and improves diversity in LLM training data across domains such as reasoning and instruction following.

major comments (3)
  1. [Abstract and §4 (use cases)] Abstract and use-case sections: The paper asserts successful application to mathematical reasoning, instructions, and other tasks but provides no quantitative metrics (e.g., accuracy, diversity scores, or human preference ratings), ablation studies, or error analysis to support the quality of the generated data or the effectiveness of the persona simulation.
  2. [§3 (Persona Hub construction)] Persona curation methodology: The automatic web-based curation process lacks any described mechanism or metric for enforcing global demographic balance (language, geography, age, occupation) or deduplication; without such controls, the claim that the personas tap 'almost every perspective' risks being undermined by known web-data skews toward English-speaking and digitally active populations.
  3. [§3 and §4] Diversity and fidelity evaluation: No comparison is presented between the persona distribution and real-world census or survey benchmarks, nor any analysis of repetition rates or hallucinated perspectives in the LLM-simulated outputs, which are load-bearing for the 'distributed carriers of world knowledge' premise.
minor comments (2)
  1. [Abstract] The abstract and introduction could more precisely define 'diverse' and 'almost every perspective' with reference to measurable criteria rather than qualitative assertion.
  2. [Figures and tables in §4] Figure captions and table descriptions would benefit from explicit statements of sample sizes and evaluation protocols used in the use-case demonstrations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4 (use cases)] Abstract and use-case sections: The paper asserts successful application to mathematical reasoning, instructions, and other tasks but provides no quantitative metrics (e.g., accuracy, diversity scores, or human preference ratings), ablation studies, or error analysis to support the quality of the generated data or the effectiveness of the persona simulation.

    Authors: We agree that the use-case demonstrations would be strengthened by quantitative support. In the revised manuscript we have added accuracy metrics for the mathematical and logical reasoning tasks (measured against reference solutions), embedding-based diversity scores across generated outputs, and human preference ratings collected on a sampled subset of the data. Ablation studies on the effect of persona count are now included in the appendix, together with a dedicated error analysis subsection in §4. revision: yes

  2. Referee: [§3 (Persona Hub construction)] Persona curation methodology: The automatic web-based curation process lacks any described mechanism or metric for enforcing global demographic balance (language, geography, age, occupation) or deduplication; without such controls, the claim that the personas tap 'almost every perspective' risks being undermined by known web-data skews toward English-speaking and digitally active populations.

    Authors: The curation pipeline in §3 is intentionally automatic and web-driven to reach one-billion scale. Explicit global demographic quotas were not imposed because defining and enforcing balanced targets across all attributes at this scale is methodologically and computationally challenging. However, we did apply embedding-based deduplication with a cosine-similarity threshold. We have expanded §3 to report language and geographic distributions observed in the final set and have added an explicit limitations paragraph acknowledging web-induced skews toward digitally active populations. revision: partial

  3. Referee: [§3 and §4] Diversity and fidelity evaluation: No comparison is presented between the persona distribution and real-world census or survey benchmarks, nor any analysis of repetition rates or hallucinated perspectives in the LLM-simulated outputs, which are load-bearing for the 'distributed carriers of world knowledge' premise.

    Authors: We acknowledge the value of external validation. The revised manuscript now includes a new subsection comparing inferred persona attributes (occupation, location, age proxies) against publicly available demographic aggregates where direct alignment is possible. Repetition is quantified via the deduplication statistics already computed during construction. A manual review of hallucinated or low-fidelity perspectives on a random sample of generated outputs has been added to §4, with the results reported. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper proposes curating 1 billion personas from external web data to drive LLM-based synthetic data synthesis across scenarios like math problems and instructions. This methodology depends on web curation processes and LLM prompting rather than any self-referential equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to inputs by construction. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a load-bearing way. The demonstrations are presented as empirical use cases, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that web-derived personas provide broad, independent coverage of world knowledge that LLMs can faithfully role-play; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Large language models can accurately simulate a wide range of human-like personas extracted from web text without systematic bias or loss of diversity.
    Invoked when claiming the 1B personas tap 'almost every perspective' inside the LLM.
invented entities (1)
  • Persona Hub no independent evidence
    purpose: A curated collection of 1 billion personas serving as carriers of diverse world knowledge for data synthesis.
    Newly introduced dataset whose diversity and coverage are asserted but not independently verified outside the paper.

pith-pipeline@v0.9.0 · 5468 in / 1249 out tokens · 44444 ms · 2026-05-16T00:00:27.724368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.LawOfExistence unity_unique_existent unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  2. Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    PPol uses LLM-driven evolutionary program search to create diverse human-like user personas for simulators, yielding 33-62% fitness gains and +17% agent task success on retail and airline domains.

  3. DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

    cs.CL 2026-05 unverdicted novelty 7.0

    DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.

  4. Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

  5. Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.

  6. C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

    cs.CL 2026-04 unverdicted novelty 7.0

    C-Mining automatically mines high-fidelity Culture Points from raw multilingual text by treating cross-lingual geometric isolation in embeddings as a quantifiable signal for cultural specificity, then uses them to syn...

  7. Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.

  8. MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    cs.CV 2026-04 unverdicted novelty 7.0

    Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...

  9. SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams

    cs.CL 2026-03 unverdicted novelty 7.0

    SensorPersona uses LLMs for hierarchical reasoning on longitudinal mobile sensor streams to continually extract stable personas, showing up to 31.4% higher recall and 85.7% win rate over baselines on a 20-user dataset.

  10. MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    MemPrivacy uses edge-side privacy span detection and semantic placeholders to enable cloud memory management for LLM agents while limiting utility loss to 1.6% and outperforming masking baselines.

  11. MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    MemPrivacy replaces privacy-sensitive spans with structured placeholders on edge devices to enable effective cloud memory management while limiting utility loss to 1.6% and outperforming general models on privacy extraction.

  12. MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    MemPrivacy uses edge detection of sensitive spans and type-aware placeholders to enable cloud-side memory management for LLM agents without exposing private data, achieving under 1.6% utility loss.

  13. CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

    cs.AI 2026-04 unverdicted novelty 6.0

    CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.

  14. Opal: Private Memory for Personal AI

    cs.CR 2026-04 unverdicted novelty 6.0

    Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.

  15. Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

    cs.CV 2026-03 conditional novelty 6.0

    PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.

  16. PersonaVLM: Long-Term Personalized Multimodal LLMs

    cs.CL 2026-03 unverdicted novelty 6.0

    PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.

  17. Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

    cs.CL 2026-05 unverdicted novelty 5.0

    Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.

  18. UserGPT Technical Report

    cs.IR 2026-05 unverdicted novelty 5.0

    UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while...

  19. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  20. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    cs.CL 2025-06 unverdicted novelty 4.0

    Qwen3 Embedding models in 0.6B-8B sizes achieve state-of-the-art results on MTEB and retrieval tasks including code, cross-lingual, and multilingual retrieval through unsupervised pre-training, supervised fine-tuning,...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 18 Pith papers · 11 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  3. [3]

    Coig-cqia: Quality is all you need for chinese instruction fine-tuning

    Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, et al. Coig-cqia: Quality is all you need for chinese instruction fine-tuning. arXiv preprint arXiv:2403.18058,

  4. [4]

    Comprehensive exploration of synthetic data generation: A survey

    Andr´e Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, and Ian Foster. Comprehensive exploration of synthetic data generation: A survey. arXiv preprint arXiv:2401.02524,

  5. [5]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

  6. [6]

    On the resemblance and containment of documents

    Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp. 21–29. IEEE,

  7. [7]

    Large language models as tool makers

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint arXiv:2305.17126,

  8. [8]

    On the possibilities of ai-generated text detection

    Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, and Furong Huang. On the possibilities of ai-generated text detection. arXiv preprint arXiv:2304.04736,

  9. [9]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588,

  10. [10]

    URL https://github.com/togethercomputer/RedPajama-Data. Gr´egoire Del´etang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christo- pher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language modeling is compression. arXiv preprint arXiv:2309.10668,

  11. [11]

    A tale of tails: Model collapse as a change of scaling laws

    Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, and Julia Kempe. A tale of tails: Model collapse as a change of scaling laws. arXiv preprint arXiv:2402.07043,

  12. [12]

    Strategic reasoning with language models

    Kanishk Gandhi, Dorsa Sadigh, and Noah D Goodman. Strategic reasoning with language models. arXiv preprint arXiv:2305.19165,

  13. [13]

    Measuring Mathematical Problem Solving With the MATH Dataset

    URL https://openreview.net/forum?id=uREj4ZuGJE. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

  14. [14]

    Key-point-driven data synthesis with its enhancement on mathematical reasoning

    Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, and Weizhu Chen. Key-point-driven data synthesis with its enhancement on mathematical reasoning. arXiv preprint arXiv:2403.02333,

  15. [15]

    Faithful persona-based conversational dataset generation with large language models

    Pegah Jandaghi, XiangHai Sheng, Xinyi Bai, Jay Pujara, and Hakim Sidahmed. Faithful persona-based conversational dataset generation with large language models. arXiv preprint arXiv:2312.10007,

  16. [16]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  17. [17]

    Common 7b language models already possess strong math capabilities

    Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. Common 7b language models already possess strong math capabilities. arXiv preprint arXiv:2403.04706, 2024a. Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, e...

  18. [18]

    Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization

    Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170,

  19. [19]

    Rephras- ing the web: A recipe for compute and data-efficient language modeling

    Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. Rephras- ing the web: A recipe for compute and data-efficient language modeling. arXiv preprint arXiv:2401.16380,

  20. [20]

    On the risk of misinformation pollution with large language models

    Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661,

  21. [21]

    The curse of recursion: Training on generated data makes models forget

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Ander- son. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493,

  22. [22]

    Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023

    Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731,

  23. [23]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560,

  24. [24]

    Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi- persona self-collaboration

    Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi- persona self-collaboration. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vo...

  25. [25]

    Hallucination is Inevitable: An Innate Limitation of Large Language Models

    Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817,

  26. [26]

    Yi: Open Foundation Models by 01.AI

    ai. arXiv preprint arXiv:2403.04652,

  27. [27]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284,

  28. [28]

    Llm as a mastermind: A survey of strategic reasoning with large language models

    Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, and Furu Wei. Llm as a mastermind: A survey of strategic reasoning with large language models. arXiv preprint arXiv:2404.01230,

  29. [29]

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    URL https://openreview.net/forum?id=Bl8u7ZRlbM. Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931,