pith. machine review for the scientific record. sign in

arxiv: 2309.05463 · v1 · submitted 2023-09-11 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Textbooks Are All You Need II: phi-1.5 technical report

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords phi-1.5synthetic datasmall language modelsreasoning taskstextbook quality datatransformer modelsgrade-school mathcoding performance
0
0 comments X

The pith

A 1.3 billion parameter model trained on synthetic textbooks matches models five times larger on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that high-quality synthetic data can let a small Transformer language model reach strong results on natural language understanding and reasoning benchmarks. Using larger models to generate textbook-style content, the authors train phi-1.5 to perform comparably to much bigger systems on grade-school math and basic coding while displaying step-by-step thinking and basic in-context learning. The clean data source also reduces toxic or biased outputs compared with web-trained models. This work extends the earlier phi-1 coding model by applying the same data-quality focus to broader common-sense reasoning.

Core claim

phi-1.5 is a 1.3 billion parameter Transformer trained primarily on synthetic textbook-quality data generated by existing large language models. It reaches performance on natural language tasks comparable to models five times its size and surpasses most non-frontier LLMs on grade-school mathematics and basic coding. The model exhibits traits of much larger systems such as thinking step by step and rudimentary in-context learning, while showing fewer toxic generations because web data was avoided.

What carries the argument

Synthetic textbook-quality data generated by larger LLMs, used in place of web-scraped text to train the smaller model.

If this is right

  • Smaller models can display chain-of-thought reasoning and in-context learning when trained on clean synthetic data.
  • Avoiding web data reduces the rate of toxic or biased outputs.
  • Performance on math and coding tasks can improve through data quality rather than parameter count alone.
  • Open-sourcing the model enables community checks on whether the observed abilities generalize beyond the training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data curation methods may prove more efficient than raw scaling for building capable models on targeted reasoning tasks.
  • The same synthetic-data approach could be tested on domains outside language, such as basic science or logic puzzles.
  • Future evaluations should include out-of-distribution problems to check whether the model truly reasons or has memorized benchmark styles.
  • Open release allows independent tests of whether the reduced toxicity persists when the model is fine-tuned on new data.

Load-bearing premise

Standard grade-school math and coding benchmarks measure genuine reasoning rather than the model simply matching patterns present in the synthetic training data.

What would settle it

A new collection of math and coding problems written to avoid stylistic patterns from the synthetic textbooks; if phi-1.5 performance drops sharply on this set while larger models do not, the central claim is weakened.

read the original abstract

We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi-1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, \textbf{phi-1.5} exhibits many of the traits of much larger LLMs, both good -- such as the ability to ``think step by step" or perform some rudimentary in-context learning -- and bad, including hallucinations and the potential for toxic and biased generations -- encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source \textbf{phi-1.5} to promote further research on these urgent topics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces phi-1.5, a 1.3B-parameter Transformer model trained primarily on synthetic 'textbook-quality' data generated by larger LLMs. Building on prior work with TinyStories and phi-1, it focuses on common-sense reasoning in natural language and reports that phi-1.5 achieves performance on natural language tasks comparable to models 5x larger while surpassing most non-frontier LLMs on grade-school mathematics and basic coding. The model is shown to exhibit step-by-step reasoning, rudimentary in-context learning, hallucinations, and reduced toxicity due to the absence of web data; the model weights are open-sourced.

Significance. If the empirical claims hold after proper validation, the result would demonstrate that high-quality synthetic data can substantially narrow the performance gap between small and large models on reasoning tasks, offering a data-centric alternative to pure scaling and potentially reducing reliance on noisy web corpora.

major comments (2)
  1. [Abstract] Abstract: the central claim that phi-1.5 matches models 5x larger on natural language tasks and surpasses most non-frontier LLMs on grade-school math and coding is stated without any benchmark scores, model-size comparisons, baselines, or statistical details; this absence makes the headline result impossible to evaluate from the provided text.
  2. [Abstract] The training corpus is generated by larger LLMs; no decontamination statistics, n-gram overlap analysis, or ablation isolating synthetic versus web data on held-out problems (e.g., GSM8K) are supplied, leaving open the possibility that reported gains reflect distributional overlap rather than genuine reasoning improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing that the abstract can be strengthened with concrete metrics and that additional data-quality analyses would improve transparency. We will incorporate these changes in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that phi-1.5 matches models 5x larger on natural language tasks and surpasses most non-frontier LLMs on grade-school math and coding is stated without any benchmark scores, model-size comparisons, baselines, or statistical details; this absence makes the headline result impossible to evaluate from the provided text.

    Authors: We agree that the abstract would benefit from explicit benchmark numbers to make the central claims immediately evaluable. The full paper already contains the supporting evidence in Section 4 and Tables 1–3 (e.g., phi-1.5 at 1.3B matches or exceeds several 7B models on ARC, BoolQ, and PIQA while outperforming most non-frontier models on GSM8K and HumanEval). In the revision we will add representative scores and size comparisons directly to the abstract. revision: yes

  2. Referee: [Abstract] The training corpus is generated by larger LLMs; no decontamination statistics, n-gram overlap analysis, or ablation isolating synthetic versus web data on held-out problems (e.g., GSM8K) are supplied, leaving open the possibility that reported gains reflect distributional overlap rather than genuine reasoning improvement.

    Authors: We acknowledge this is a fair point. Although the synthetic textbook data is generated via prompted larger models with explicit instructions for diversity and step-by-step reasoning, the original submission did not report n-gram overlap or decontamination statistics against held-out sets. We will add these analyses (including 13-gram overlap with GSM8K and an ablation of synthetic-only versus mixed training) in a new subsection of the data section to rule out contamination as the source of gains. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims rest on external benchmark evaluations

full rationale

The paper is a technical report describing the training of phi-1.5 on synthetic textbook data generated by larger LLMs and reporting its empirical results on standard natural language, math, and coding benchmarks. No derivation chain, equations, or first-principles predictions are presented that reduce to fitted parameters or self-referential inputs. The central claims are benchmark scores (e.g., comparable to 5x larger models on grade-school math and coding), which are measured against external test sets rather than constructed from the paper's own definitions or prior outputs. Reference to the prior 'Textbooks Are All You Need' work is methodological context only and does not bear the load of the performance results. This is a standard empirical report with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claim is an empirical performance statement; no explicit free parameters, axioms, or invented entities are introduced in the abstract beyond standard assumptions of transformer training and benchmark validity.

pith-pipeline@v0.9.0 · 5579 in / 1039 out tokens · 46880 ms · 2026-05-14T19:13:13.944635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We follow the “Textbooks Are All You Need” approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named phi-1.5, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our training data for phi-1.5 is a combination of phi-1’s training data (7B tokens) and newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  2. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  3. Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning

    cs.CL 2026-05 unverdicted novelty 6.0

    TokenUnlearn identifies critical tokens via masking and entropy signals then applies hard selection or soft weighting to unlearn only those tokens, yielding better forgetting and retained utility than sequence-level b...

  4. Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

    cs.SE 2026-04 unverdicted novelty 6.0

    Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...

  5. CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

    cs.CV 2026-04 unverdicted novelty 6.0

    CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.

  6. Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

    cs.CV 2026-04 unverdicted novelty 6.0

    IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.

  7. GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization

    cs.DB 2026-04 unverdicted novelty 6.0

    GRACE dynamically constructs and updates coresets for LLM training using representation diversity, gradient-based importance, and k-NN graph propagation to improve efficiency and performance.

  8. Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

    cs.CV 2026-04 unverdicted novelty 6.0

    Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose...

  9. ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.

  10. OpenVLA: An Open-Source Vision-Language-Action Model

    cs.RO 2024-06 unverdicted novelty 6.0

    OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

  11. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    cs.CL 2024-04 accept novelty 6.0

    Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

  12. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  13. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  14. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    cs.CV 2024-08 unverdicted novelty 5.0

    Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

  15. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  16. DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs

    cs.CR 2026-04 unverdicted novelty 4.0

    DP-FLogTinyLLM combines federated learning, differential privacy, and LoRA-tuned tiny LLMs to match centralized log anomaly detection performance on Thunderbird and BGL datasets while preserving privacy.

  17. SLM Finetuning for Natural Language to Domain Specific Code Generation in Production

    cs.LG 2026-04 unverdicted novelty 3.0

    Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...

  18. Gemma 2: Improving Open Language Models at a Practical Size

    cs.CL 2024-07 conditional novelty 3.0

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

  19. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

  20. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 20 Pith papers · 13 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    [AON+21] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 ,

  2. [2]

    Identify, align, and integrate: Matching knowledge graphs to commonsense reasoning tasks

    [BB21] Lisa Bauer and Mohit Bansal. Identify, align, and integrate: Matching knowledge graphs to commonsense reasoning tasks. arXiv preprint arXiv:2104.10193 ,

  3. [3]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    [BCE+23] S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,

  4. [4]

    On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages 610–623,

    [BGMMS21] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages 610–623,

  5. [5]

    Piqa: Reasoning about physical commonsense in natural language

    [BHT+19] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Y Chai, Mirella Lapata, Angeliki Lazaridou, Ryan J Maynez, Piyush Narang, et al. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641,

  6. [6]

    Training Verifiers to Solve Math Word Problems

    [CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher 13 Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  7. [7]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    [CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...

  8. [8]

    PaLM: Scaling Language Modeling with Pathways

    [CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 ,

  9. [9]

    Evaluating Large Language Models Trained on Code

    [CTJ+21] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Eval- uating large language models trained on code. arXiv preprint arXiv:2107.03374 ,

  10. [10]

    Tinystories: How small can language models be and still speak coherent english?

    [EL23] Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759 ,

  11. [11]

    First steps of an approach to the arc challenge based on descriptive grid models and the minimum description length principle

    [Fer21] S´ ebastien Ferr´ e. First steps of an approach to the arc challenge based on descriptive grid models and the minimum description length principle. arXiv preprint arXiv:2112.00848 ,

  12. [12]

    Textbooks Are All You Need

    [GZA+23] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´ esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Gustavo de Rosa Piero Kauffmann, Olli Saarikivia, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S´ ebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. arXiv prepr...

  13. [13]

    Measuring Massive Multitask Language Understanding

    [HBB+20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

  14. [14]

    arXiv preprint arXiv:2203.09509 , year=

    [HGP+22] Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509 ,

  15. [15]

    An empirical study of metrics to measure representational harms in pre-trained language models

    14 [HPA23] Saghar Hosseini, Hamid Palangi, and Ahmed Hassan Awadallah. An empirical study of metrics to measure representational harms in pre-trained language models. arXiv preprint arXiv:2301.09211,

  16. [16]

    The stack: 3 tb of permissively licensed source code

    [KLA+22] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu˜ noz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533 ,

  17. [17]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    [MCKS18] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789,

  18. [18]

    GPT-4 Technical Report

    arXiv preprint arXiv:2303.08774 [cs.CL]. [PMH+23] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 ,

  19. [19]

    Answering questions by learning to rank

    [PRR19] George-Sebastian Pˆ ırtoac˘ a, Traian Rebedea, and Stefan Ruseti. Answering questions by learning to rank. arXiv preprint arXiv:1909.00596 ,

  20. [20]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    [RZLL16] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 ,

  21. [21]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    [SLBBC19] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,

  22. [22]

    LLaMA: Open and Efficient Foundation Language Models

    [TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 ,

  23. [23]

    Taxonomy of risks posed by language models

    15 [WUR+22] Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229,

  24. [24]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    [ZCS+23] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 ,