{"paper":{"title":"Phi-4-reasoning Technical Report","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems.","cross_cats":["cs.CL"],"primary_cat":"cs.AI","authors_text":"Ahmed Awadallah, Arindam Mitra, Besmira Nushi, Caio C\\'esar Teodoro Mendes, Dimitris Papailiopoulos, Guoqing Zheng, Gustavo de Rosa, Harkirat Behl, Lingjiao Chen, Marah Abdin, Mojan Javaheripi, Neel Joshi, Olli Saarikivi, Piero Kauffmann, Safoora Yousefi, Sahaj Agarwal, Shital Shah, Suriya Gunasekar, Vaishnavi Shrivastava, Vibhav Vineet, Vidhisha Balachandran, Yash Lara, Yue Wu","submitted_at":"2025-04-30T05:05:09Z","abstract_excerpt":"We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of \"teachable\" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generatin"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the performance improvements stem primarily from the curated 'teachable' prompts and o3-mini demonstrations rather than from undisclosed details of the base Phi-4 model, evaluation choices, or overlap with the teacher model's training data.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b54824a526e19f3ddc8207541fac113367383b8c02acd126a6a06a2fa7f71b27"},"source":{"id":"2504.21318","kind":"arxiv","version":1},"verdict":{"id":"ce28904c-25c4-49ee-be44-7af242a44fef","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T03:35:32.502533Z","strongest_claim":"Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model.","one_line_summary":"A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the performance improvements stem primarily from the curated 'teachable' prompts and o3-mini demonstrations rather than from undisclosed details of the base Phi-4 model, evaluation choices, or overlap with the teacher model's training data.","pith_extraction_headline":"A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems."},"references":{"count":64,"sample":[{"doi":"","year":2024,"title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","work_id":"feef9556-a016-493c-abd2-0c97a23a7ebf","ref_index":1,"cited_arxiv_id":"2404.14219","is_internal_anchor":true},{"doi":"","year":2024,"title":"Phi-4 Technical Report","work_id":"b6274271-7af9-4ee8-993b-ba1ba4205ba8","ref_index":2,"cited_arxiv_id":"2412.08905","is_internal_anchor":true},{"doi":"","year":2024,"title":"KITAB: evaluating llms on constraint satisfaction for information retrieval","work_id":"aaa04fb3-c48c-4dd7-af3a-fdd9d02aab90","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"AIME. Aime 83-24. https://huggingface.co/datasets/lchen001/AIME1983_2024, 2024. Accessed: 2025- 03-17","work_id":"4b96ed08-bbb1-4ae7-9952-1a0850c4901e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"AIME. Aime 2025. https://huggingface.co/datasets/lchen001/AIME2025, 2025. Accessed: 2025-03-17","work_id":"7b05bc8a-1b03-4a13-9850-a3c06a16b3b4","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":64,"snapshot_sha256":"2bef4ff0aeeb823ff1d5bd4a9ed57fec0220a0949efd5330a8c516d47d619e4f","internal_anchors":20},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b16ce42b4f0db5f06ab4f3721d7092b074f737f2bd355a72623434b724d4fdfa"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}