{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:CSY654CRWD64OIPFZ2LWXPSWC2","short_pith_number":"pith:CSY654CR","schema_version":"1.0","canonical_sha256":"14b1eef051b0fdc721e5ce976bbe56168c5f4b9b3db39f240432fa7349969614","source":{"kind":"arxiv","id":"2504.21318","version":1},"attestation_state":"computed","paper":{"title":"Phi-4-reasoning Technical Report","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems.","cross_cats":["cs.CL"],"primary_cat":"cs.AI","authors_text":"Ahmed Awadallah, Arindam Mitra, Besmira Nushi, Caio C\\'esar Teodoro Mendes, Dimitris Papailiopoulos, Guoqing Zheng, Gustavo de Rosa, Harkirat Behl, Lingjiao Chen, Marah Abdin, Mojan Javaheripi, Neel Joshi, Olli Saarikivi, Piero Kauffmann, Safoora Yousefi, Sahaj Agarwal, Shital Shah, Suriya Gunasekar, Vaishnavi Shrivastava, Vibhav Vineet, Vidhisha Balachandran, Yash Lara, Yue Wu","submitted_at":"2025-04-30T05:05:09Z","abstract_excerpt":"We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of \"teachable\" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generatin"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2504.21318","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.AI","submitted_at":"2025-04-30T05:05:09Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"88aed280d7a9e33a84ea4a72eb25e0c5b88ae04d91c643a83976bc6cae82e7f8","abstract_canon_sha256":"34014de1e10e2d80b34c097e7349a5a05a3af62c4f55a5452f4ea11ba606fe0f"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:15.236662Z","signature_b64":"dsBKsP0orc7VCVS0MgkAFwFSBz3wZwW6svuN7Obdyc0bWV5v8+qjB9dOpcz6u92fOQzXB3+K1VwM02Y4ukpqDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"14b1eef051b0fdc721e5ce976bbe56168c5f4b9b3db39f240432fa7349969614","last_reissued_at":"2026-05-17T23:38:15.236139Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:15.236139Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Phi-4-reasoning Technical Report","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems.","cross_cats":["cs.CL"],"primary_cat":"cs.AI","authors_text":"Ahmed Awadallah, Arindam Mitra, Besmira Nushi, Caio C\\'esar Teodoro Mendes, Dimitris Papailiopoulos, Guoqing Zheng, Gustavo de Rosa, Harkirat Behl, Lingjiao Chen, Marah Abdin, Mojan Javaheripi, Neel Joshi, Olli Saarikivi, Piero Kauffmann, Safoora Yousefi, Sahaj Agarwal, Shital Shah, Suriya Gunasekar, Vaishnavi Shrivastava, Vibhav Vineet, Vidhisha Balachandran, Yash Lara, Yue Wu","submitted_at":"2025-04-30T05:05:09Z","abstract_excerpt":"We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of \"teachable\" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generatin"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the performance improvements stem primarily from the curated 'teachable' prompts and o3-mini demonstrations rather than from undisclosed details of the base Phi-4 model, evaluation choices, or overlap with the teacher model's training data.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b54824a526e19f3ddc8207541fac113367383b8c02acd126a6a06a2fa7f71b27"},"source":{"id":"2504.21318","kind":"arxiv","version":1},"verdict":{"id":"ce28904c-25c4-49ee-be44-7af242a44fef","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T03:35:32.502533Z","strongest_claim":"Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model.","one_line_summary":"A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the performance improvements stem primarily from the curated 'teachable' prompts and o3-mini demonstrations rather than from undisclosed details of the base Phi-4 model, evaluation choices, or overlap with the teacher model's training data.","pith_extraction_headline":"A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems."},"references":{"count":64,"sample":[{"doi":"","year":2024,"title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","work_id":"feef9556-a016-493c-abd2-0c97a23a7ebf","ref_index":1,"cited_arxiv_id":"2404.14219","is_internal_anchor":true},{"doi":"","year":2024,"title":"Phi-4 Technical Report","work_id":"b6274271-7af9-4ee8-993b-ba1ba4205ba8","ref_index":2,"cited_arxiv_id":"2412.08905","is_internal_anchor":true},{"doi":"","year":2024,"title":"KITAB: evaluating llms on constraint satisfaction for information retrieval","work_id":"aaa04fb3-c48c-4dd7-af3a-fdd9d02aab90","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"AIME. Aime 83-24. https://huggingface.co/datasets/lchen001/AIME1983_2024, 2024. Accessed: 2025- 03-17","work_id":"4b96ed08-bbb1-4ae7-9952-1a0850c4901e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"AIME. Aime 2025. https://huggingface.co/datasets/lchen001/AIME2025, 2025. Accessed: 2025-03-17","work_id":"7b05bc8a-1b03-4a13-9850-a3c06a16b3b4","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":64,"snapshot_sha256":"2bef4ff0aeeb823ff1d5bd4a9ed57fec0220a0949efd5330a8c516d47d619e4f","internal_anchors":20},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b16ce42b4f0db5f06ab4f3721d7092b074f737f2bd355a72623434b724d4fdfa"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2504.21318","created_at":"2026-05-17T23:38:15.236236+00:00"},{"alias_kind":"arxiv_version","alias_value":"2504.21318v1","created_at":"2026-05-17T23:38:15.236236+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2504.21318","created_at":"2026-05-17T23:38:15.236236+00:00"},{"alias_kind":"pith_short_12","alias_value":"CSY654CRWD64","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"CSY654CRWD64OIPF","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"CSY654CR","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":20,"internal_anchor_count":20,"sample":[{"citing_arxiv_id":"2509.16343","citing_title":"Visual Reasoning Agent: Robust Vision Systems in Remote Sensing via Inference-Time Scaling","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2510.04265","citing_title":"Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation","ref_index":96,"is_internal_anchor":true},{"citing_arxiv_id":"2509.08827","citing_title":"A Survey of Reinforcement Learning for Large Reasoning Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2601.14249","citing_title":"Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2601.18832","citing_title":"The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2601.21684","citing_title":"Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2602.09782","citing_title":"Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2603.10960","citing_title":"Ranking Reasoning LLMs under Test-Time Scaling","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23281","citing_title":"MathArena: Evaluating LLMs on Uncontaminated Math Competitions","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03231","citing_title":"CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11553","citing_title":"TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09260","citing_title":"Chain-of-Thought Reasoning Enhances In-Context Learning for LLM-Based Mobile Traffic Prediction","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25235","citing_title":"VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02290","citing_title":"Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00674","citing_title":"Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs","ref_index":97,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00072","citing_title":"XekRung Technical Report","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17940","citing_title":"When AI Models Become Dependencies: Studying the Evolution of Pre-Trained Model Reuse in Downstream Software Systems","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07035","citing_title":"Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08299","citing_title":"SeLaR: Selective Latent Reasoning in Large Language Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07864","citing_title":"ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?","ref_index":1,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/CSY654CRWD64OIPFZ2LWXPSWC2","json":"https://pith.science/pith/CSY654CRWD64OIPFZ2LWXPSWC2.json","graph_json":"https://pith.science/api/pith-number/CSY654CRWD64OIPFZ2LWXPSWC2/graph.json","events_json":"https://pith.science/api/pith-number/CSY654CRWD64OIPFZ2LWXPSWC2/events.json","paper":"https://pith.science/paper/CSY654CR"},"agent_actions":{"view_html":"https://pith.science/pith/CSY654CRWD64OIPFZ2LWXPSWC2","download_json":"https://pith.science/pith/CSY654CRWD64OIPFZ2LWXPSWC2.json","view_paper":"https://pith.science/paper/CSY654CR","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2504.21318&json=true","fetch_graph":"https://pith.science/api/pith-number/CSY654CRWD64OIPFZ2LWXPSWC2/graph.json","fetch_events":"https://pith.science/api/pith-number/CSY654CRWD64OIPFZ2LWXPSWC2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/CSY654CRWD64OIPFZ2LWXPSWC2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/CSY654CRWD64OIPFZ2LWXPSWC2/action/storage_attestation","attest_author":"https://pith.science/pith/CSY654CRWD64OIPFZ2LWXPSWC2/action/author_attestation","sign_citation":"https://pith.science/pith/CSY654CRWD64OIPFZ2LWXPSWC2/action/citation_signature","submit_replication":"https://pith.science/pith/CSY654CRWD64OIPFZ2LWXPSWC2/action/replication_record"}},"created_at":"2026-05-17T23:38:15.236236+00:00","updated_at":"2026-05-17T23:38:15.236236+00:00"}