{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:3T7I6SRLYYQSMS2HDBIITTL7EY","short_pith_number":"pith:3T7I6SRL","schema_version":"1.0","canonical_sha256":"dcfe8f4a2bc621264b47185089cd7f26248b30c7e9609908cb419894e397dff4","source":{"kind":"arxiv","id":"2310.01377","version":2},"attestation_state":"computed","paper":{"title":"UltraFeedback: Boosting Language Models with Scaled AI Feedback","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Bingxiang He, Ganqu Cui, Guanming Yao, Guotong Xie, Lifan Yuan, Maosong Sun, Ning Ding, Ruobing Xie, Wei Zhu, Yankai Lin, Yuan Ni, Zhiyuan Liu","submitted_at":"2023-10-02T17:40:01Z","abstract_excerpt":"Learning from human feedback has become a pivot technique in aligning large language models (LLMs) with human preferences. However, acquiring vast and premium human feedback is bottlenecked by time, labor, and human capability, resulting in small sizes or limited topics of current datasets. This further hinders feedback learning as well as alignment research within the open-source community. To address this issue, we explore how to go beyond human feedback and collect high-quality \\textit{AI feedback} automatically for a scalable alternative. Specifically, we identify \\textbf{scale and diversi"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2310.01377","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by-sa/4.0/","primary_cat":"cs.CL","submitted_at":"2023-10-02T17:40:01Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"b7b8be285286f3dd7d47544a7033add9fc57876b36c4cf43b92d8ac8f1cd2f66","abstract_canon_sha256":"1d36aa47da97202909f564bcf2fd99c5f68f7c70f1de301a52bfcd55c832cdff"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.587095Z","signature_b64":"049o1knKGsFqThP+wGkEo0Fq76L3edqexuZ5UYUOAEePmNJ0iEctiq3GzGa4fWqvQO+MvR5HFDw/X1ALYa6NDA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"dcfe8f4a2bc621264b47185089cd7f26248b30c7e9609908cb419894e397dff4","last_reissued_at":"2026-05-17T23:38:13.586464Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.586464Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"UltraFeedback: Boosting Language Models with Scaled AI Feedback","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Bingxiang He, Ganqu Cui, Guanming Yao, Guotong Xie, Lifan Yuan, Maosong Sun, Ning Ding, Ruobing Xie, Wei Zhu, Yankai Lin, Yuan Ni, Zhiyuan Liu","submitted_at":"2023-10-02T17:40:01Z","abstract_excerpt":"Learning from human feedback has become a pivot technique in aligning large language models (LLMs) with human preferences. However, acquiring vast and premium human feedback is bottlenecked by time, labor, and human capability, resulting in small sizes or limited topics of current datasets. This further hinders feedback learning as well as alignment research within the open-source community. To address this issue, we explore how to go beyond human feedback and collect high-quality \\textit{AI feedback} automatically for a scalable alternative. Specifically, we identify \\textbf{scale and diversi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Built upon UltraFeedback, we align a LLaMA-based model by best-of-n sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the series of techniques applied to mitigate annotation biases in GPT-4 feedback produces sufficiently reliable and unbiased signals for effective model alignment.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"UltraFeedback is a large-scale AI feedback dataset that enables effective alignment of open-source language models, yielding strong results on chat benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"651a10cc1350ba03067efd47bcb55e9e95127e7d55fc592ec68511628725edf1"},"source":{"id":"2310.01377","kind":"arxiv","version":2},"verdict":{"id":"252a8e97-63b0-4ad7-9667-1cd978ace386","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T16:26:16.522516Z","strongest_claim":"Built upon UltraFeedback, we align a LLaMA-based model by best-of-n sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks.","one_line_summary":"UltraFeedback is a large-scale AI feedback dataset that enables effective alignment of open-source language models, yielding strong results on chat benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the series of techniques applied to mitigate annotation biases in GPT-4 feedback produces sufficiently reliable and unbiased signals for effective model alignment.","pith_extraction_headline":"A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models."},"references":{"count":14,"sample":[{"doi":"10.5281/zenodo.5371628","year":2021,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":1,"cited_arxiv_id":"2107.03374","is_internal_anchor":true},{"doi":"10.18653/v1/","year":2023,"title":"doi: 10.18653/v1/ 2024.findings-acl.586","work_id":"8d675bdd-79ca-48d6-9163-fc17ce0e8ece","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.48550/arxiv","year":2022,"title":"Self-critiquing models for assisting human evaluators","work_id":"3fcefdd1-22ab-4648-a683-cb1555e7a50e","ref_index":3,"cited_arxiv_id":"2206.05802","is_internal_anchor":true},{"doi":"","year":null,"title":"This may be particularly helpful if you have a busy schedule and may not have time to take them later in the day","work_id":"3761cc93-810c-498a-b8f7-6fbb54a50451","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Taking a vitamin D supplement after spending time outdoors can help boost your levels and ensure you’re getting enough","work_id":"dad4fd18-cbc0-46a0-866d-afcab590a1a9","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":14,"snapshot_sha256":"576c9171a7604250df5469674777be4ed6c66a9eead0820ce47ec4fd283f263d","internal_anchors":2},"formal_canon":{"evidence_count":2,"snapshot_sha256":"d3e53c5bd066182d0d71c7229b7c10558aaf01828949e86336fb1134216b3905"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2310.01377","created_at":"2026-05-17T23:38:13.586563+00:00"},{"alias_kind":"arxiv_version","alias_value":"2310.01377v2","created_at":"2026-05-17T23:38:13.586563+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2310.01377","created_at":"2026-05-17T23:38:13.586563+00:00"},{"alias_kind":"pith_short_12","alias_value":"3T7I6SRLYYQS","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"3T7I6SRLYYQSMS2H","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"3T7I6SRL","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":37,"internal_anchor_count":37,"sample":[{"citing_arxiv_id":"2412.08812","citing_title":"Test-Time Alignment via Hypothesis Reweighting","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2501.05465","citing_title":"Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2502.06387","citing_title":"How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18141","citing_title":"A Brief Overview: On-Policy Self-Distillation In Large Language Models","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18721","citing_title":"General Preference Reinforcement Learning","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2508.04149","citing_title":"Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18721","citing_title":"General Preference Reinforcement Learning","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12000","citing_title":"Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18721","citing_title":"General Preference Reinforcement Learning","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18141","citing_title":"A Brief Overview: On-Policy Self-Distillation In Large Language Models","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15300","citing_title":"Deep Pre-Alignment for VLMs","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2505.19134","citing_title":"Incentivizing High-Quality Human Annotations with Golden Questions","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2506.01937","citing_title":"RewardBench 2: Advancing Reward Model Evaluation","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2506.05967","citing_title":"Preference Learning for AI Alignment: a Causal Perspective","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2508.06412","citing_title":"Sample-efficient LLM Optimization with Reset Replay","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2509.20265","citing_title":"Failure Modes of Maximum Entropy RLHF","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2509.23102","citing_title":"Multiplayer Nash Preference Optimization","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2510.04595","citing_title":"SpikingMamba: Towards Energy-Efficient Large Language Models via Knowledge Distillation from Mamba","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2402.13116","citing_title":"A Survey on Knowledge Distillation of Large Language Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2410.18451","citing_title":"Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2310.16944","citing_title":"Zephyr: Direct Distillation of LM Alignment","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2406.08464","citing_title":"Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08813","citing_title":"Robust Policy Optimization to Prevent Catastrophic Forgetting","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2603.18113","citing_title":"VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04120","citing_title":"Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression","ref_index":4,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY","json":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY.json","graph_json":"https://pith.science/api/pith-number/3T7I6SRLYYQSMS2HDBIITTL7EY/graph.json","events_json":"https://pith.science/api/pith-number/3T7I6SRLYYQSMS2HDBIITTL7EY/events.json","paper":"https://pith.science/paper/3T7I6SRL"},"agent_actions":{"view_html":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY","download_json":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY.json","view_paper":"https://pith.science/paper/3T7I6SRL","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2310.01377&json=true","fetch_graph":"https://pith.science/api/pith-number/3T7I6SRLYYQSMS2HDBIITTL7EY/graph.json","fetch_events":"https://pith.science/api/pith-number/3T7I6SRLYYQSMS2HDBIITTL7EY/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/storage_attestation","attest_author":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/author_attestation","sign_citation":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/citation_signature","submit_replication":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/replication_record"}},"created_at":"2026-05-17T23:38:13.586563+00:00","updated_at":"2026-05-17T23:38:13.586563+00:00"}