{"total":12,"items":[{"citing_arxiv_id":"2605.09636","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation","primary_cat":"cs.AI","submitted_at":"2026-05-10T16:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and reasoning in large language models.arXiv preprint arXiv:2504.16074, 2025. [36] Zhongkai Hao, Jiachen Yao, Chang Su, Hang Su, Ziao Wang, Fanzhi Lu, Zeyu Xia, Yichi Zhang, Songming Liu, Lu Lu, et al. PINNacle: A comprehensive benchmark of physics-informed neural networks for solving PDEs.Advances in Neural Information Processing Systems, 37: 76721-76774, 2024. [37] Shuo Ren, Can Xie, Pu Jian, Zhenjiang Ren, Chunlin Leng, and Jiajun Zhang. Towards scientific intelligence: A survey of llm-based scientific agents, 2026. URL https://arxiv.org/abs/ 2503.24047. 12 [38] Jur 'gis Ruža and Rafael Gomez-Bombarelli. Reasoning-to-simulation: An agentic framework for discovery of electrolyte materials. InAI for Accelerated Materials Design - ICLR 2026,"},{"citing_arxiv_id":"2604.27351","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Heterogeneous Scientific Foundation Model Collaboration","primary_cat":"cs.AI","submitted_at":"2026-04-30T03:02:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"benchmark exploring variable, process, and solution dimensions. In Christos Christodoulopoulos, Tan- moy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 17290-17316. Association for Computational Linguistics, 2025. URL https://aclanthology.org/2025.findings-emnlp. 937/. [39] Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, Jianxiang Li, Feiyu Tao,"},{"citing_arxiv_id":"2604.24443","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model","primary_cat":"cs.AI","submitted_at":"2026-04-27T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23580","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement","primary_cat":"cs.RO","submitted_at":"2026-04-26T07:37:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15411","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research","primary_cat":"cs.LG","submitted_at":"2026-04-16T16:22:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02934","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PolyReal: A Benchmark for Real-World Polymer Science Workflows","primary_cat":"cs.CV","submitted_at":"2026-04-03T10:05:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"largely educational settings. More recent domain-specific benchmarks include biomedical benchmarks built on med- ical images and clinical records [21, 22, 53], as well as chemistry and physics benchmarks such as Olympi- cArena [19], OlympiadBench [15], PhysUniBench [47], PhysX [43], HiPhO [54], JEEBench [4], GPQA [42], CM- PhysBench [49], PhyBench [40], ChemBench [35], and QCBench [52], which evaluate increasingly challenging scientific reasoning. However, existing chemistry and materials benchmarks, such as MacBench [1], SFE [58], MatSciBench [55], MSQA [9], and MatCha [25], mainly emphasize iso- lated subtasks rather than the connected stages of real- world polymer research. In contrast, PolyReal focuses on"},{"citing_arxiv_id":"2603.20633","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed1.8 Model Card: Towards Generalized Real-World Agency","primary_cat":"cs.AI","submitted_at":"2026-03-21T04:03:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"applications and demonstrate the economic utility of Seed1.8. We compare the results with GPT-5-high, Claude-Sonnet-4.5, Gemini-2.5-pro and Gemini-3-pro. Specifically, we evaluate Seed1.8 on AIME-25 [4], HMMT-25(Feb) [4], BeyondAIME [7], AMO-Bench [3], IMO- AnswerBench [42], AetherCode [75], LiveCodeBench(v6) [32], LiveCodeBench pro [93], GPQA-Diamond [57], PHYBench [55], BioBench, KOR-Bench [43], ARC-AGI-1 [53], Inverse IFEval [90], MARS-Bench [81], Multi- Challenge [17], Collie-Hard [84], EIFBench [99], MMLU [27], MMLU-pro [72], SuperGPQA [19], LPFQA [97], as well as six internal benchmarks designed for high-value real-world tasks. Reasoning.We categorize reasoning into coding, mathematics, STEM, and general reasoning."},{"citing_arxiv_id":"2602.07064","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization","primary_cat":"cs.CV","submitted_at":"2026-02-05T14:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01203","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse","primary_cat":"cs.CL","submitted_at":"2026-02-01T12:45:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.08584","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ministral 3","primary_cat":"cs.CL","submitted_at":"2026-01-13T14:06:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.15745","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaDA2.0: Scaling Up Diffusion Language Models to 100B","primary_cat":"cs.LG","submitted_at":"2025-12-10T09:26:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.26574","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark","primary_cat":"cs.AI","submitted_at":"2025-09-30T17:34:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. pages, 3828-3850, 2024. [40] N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025. [41] S. Qiu, S. Guo, Z.-Y . Song, Y . Sun, Z. Cai, J. Wei, T. Luo, Y . Yin, H. Zhang, Y . Hu, et al. PHYBench: Holistic evaluation of physical perception and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025. [42] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset."}],"limit":50,"offset":0}