{"total":22,"items":[{"citing_arxiv_id":"2605.15482","ref_index":14,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-14T23:53:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FINESSE-Bench is a new hierarchical benchmark suite combining certification-style exams, trading tasks, and a Russian olympiad set to evaluate LLMs on financial competencies at multiple difficulty levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11663","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability","primary_cat":"cs.CL","submitted_at":"2026-05-12T07:22:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08738","ref_index":41,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training","primary_cat":"cs.LG","submitted_at":"2026-05-09T06:50:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27393","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction","primary_cat":"cs.CL","submitted_at":"2026-04-30T04:05:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"8 64.4 61.9 72.0 MMSI-Bench 12.1 - 11.3 14.2 16.6 Video Video-MME (w/o subs)75.666.0 71.4 70.5 70.4 LVBench62.2- 58.0 50.2 50.9 MLVU (M-Avg) 77.8 70.278.175.2 76.5 LongVideoBench (val) - 62.1 66.466.9 66.0 MotionBench -62.359.5 61.7 61.4 spoken question answering on V oiceBench [63], Speech TriviaQA [64], Speech Web Questions [65], and Speech CMMU [66]. For speech generation, we evaluate speech quality, intelligibility, speaker similarity, long-form generation, and emotion/style control using SeedTTS Test [67], LongTTS [68], Expresso [69], and ESD [70]. Text Capability.We compare MiniCPM-o 4.5 with its language backbone, Qwen3-Instruct-8B [10], to assess whether omni-modal training preserves core text abilities."},{"citing_arxiv_id":"2604.24690","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination","primary_cat":"cs.CL","submitted_at":"2026-04-27T16:50:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProHist-Bench shows that even state-of-the-art LLMs struggle with complex historical research questions requiring evidentiary reasoning, based on 400 questions and 10,891 rubrics from the Keju system.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18946","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reasoning Structure Matters for Safety Alignment of Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-04-21T00:50:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.02780","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiMo-V2-Flash Technical Report","primary_cat":"cs.CL","submitted_at":"2026-01-06T07:31:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.15745","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLaDA2.0: Scaling Up Diffusion Language Models to 100B","primary_cat":"cs.LG","submitted_at":"2025-12-10T09:26:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.18265","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","primary_cat":"cs.CV","submitted_at":"2025-08-25T17:58:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"SpaCE-10 45.5 55.0 43.4* 39.2* 37.9* 51.6* 42.6* 43.8* OmniSpatial 48.1 51.9 47.7† 37.3† 47.9† 51.0† 47.0† 59.6* Overall 60.6 66.2 - - - - - - Table 2: The overall comparison of InternVL3.5 series and existing open-source and closed-source MLLMs. *: reproduced through VLMEvalkit [31]. †: reported by GLM-4.5V [46]. ‡: reported by OpenCompass [20]. 9 MMLU-Pro [61], GAOKAO [177], IFEval [ 185]; (4) Agentic Tasks: SGP-Bench [ 102], ScreenSpot [ 16], ScreenSpot-v2 [150], OSWorld-G [156], VSI-Bench [161], ERQA [121], SpaCE-10 [38]. We report results of our flagship models (InternVL3.5-30B-A3B and InternVL3.5-241B-A28B) and frontier open-source MLLMs (GLM-4.1V [46], Kimi-VL-A3B-2506 [125], GLM-4.5V [46], Qwen2.5-VL-72B [5] and"},{"citing_arxiv_id":"2507.20534","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Kimi K2: Open Agentic Intelligence","primary_cat":"cs.LG","submitted_at":"2025-07-28T05:35:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Multi-SWE-bench [87], SWE-Lancer [51], PaperBench [66], and Aider-Polyglot [17]. For tool use tasks, we evaluate performance on τ2-Bench [3] and AceBench [7], which emphasize multi-turn tool-calling capabilities. In reasoning, we include a wide range of mathematical, science and logical tasks: AIME 2024/2025, MATH-500, HMMT 2025, CNMO 2024, PolyMath-en, ZebraLogic [44], AutoLogi [92], GPQA-Diamond [62], SuperGPQA [14], and Humanity's Last Exam (Text-Only) [57]. We benchmark the long-context capabilities on: MRCR5 for long-context retrieval, and DROP [15], FRAMES [38] and LongBench v2 [2] for long-context reasoning. For factuality, we evaluate FACTS Grounding [31], the Vectara Hallucination Leaderboard [74], and FaithJudge [69]."},{"citing_arxiv_id":"2506.12119","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource","primary_cat":"cs.CL","submitted_at":"2025-06-13T17:59:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.16155","ref_index":104,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PRIMETIME : Limits of LLMs in Temporal Primitives","primary_cat":"cs.NE","submitted_at":"2025-04-22T17:52:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10479","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"work will explore its integration into various downstream applications. 3.13 Evaluation on Language Capability Table 11 presents the performance evaluation of language capabilities across a diverse array of benchmarks. These benchmarks cover comprehensive assessments in general knowledge, linguistic understanding, reasoning, mathematics, and coding tasks, such as MMLU [ 46], CMMLU [ 63], C-Eval [48], GAOKAO-Bench [149], TriviaQA [52], NaturalQuestions [ 58, 110], RACE [ 59], WinoGrande [ 103], HellaSwag [ 142], BigBench Hard [112], GSM8K-Test [25], MATH [47], TheoremQA [17], HumanEval [14], MBPP [4], and MBPP-CN [4]. In particular, the experiments conducted compare the performance of Qwen2.5 chat models against corre- sponding InternVL3 variants."},{"citing_arxiv_id":"2502.11089","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention","primary_cat":"cs.CL","submitted_at":"2025-02-16T11:53:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.09992","ref_index":124,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Diffusion Models","primary_cat":"cs.CL","submitted_at":"2025-02-14T08:23:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. [123] Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023. [124] Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36, 2024. 17 [125] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles"},{"citing_arxiv_id":"2412.18925","ref_index":86,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","primary_cat":"cs.CL","submitted_at":"2024-12-25T15:12:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05271","ref_index":127,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"examination, language and knowledge, reasoning, mathematics, and coding. 25 6.1 Benchmarks Comprehensive Examination. We conduct a thorough evaluation of LLMs and MLLMs using various exam- related datasets: (1)MMLU[ 85] includes 57 subtasks covering diverse topics such as humanities, social sciences, and STEM, evaluated with a 5-shot approach. (2)CMMLU[ 127], focused on a Chinese context, features 67 subtasks spanning general and Chinese-specific domains, also tested in a 5-shot setting. (3)C-Eval[ 96] contains 52 subtasks across four difficulty levels, evaluated in a 5-shot setting. (4)GAOKAO-Bench[ 304], derived from Chinese college entrance exams, offers comprehensive coverage of both subjective and objective question types,"},{"citing_arxiv_id":"2405.04434","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2024-05-07T15:56:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Specifically, for each question 𝑞, GRPO samples a group of outputs {𝑜1, 𝑜2, · · ·, 𝑜𝐺} from the old policy 𝜋𝜃𝑜𝑙𝑑 and then optimizes the policy model 𝜋𝜃 by maximizing the following objective: J𝐺𝑅𝑃𝑂 (𝜃) = E[𝑞 ∼ 𝑃(𝑄), {𝑜𝑖}𝐺 𝑖=1 ∼ 𝜋𝜃𝑜𝑙𝑑 (𝑂|𝑞)] 1 𝐺 𝐺∑︁ 𝑖=1 \u0012 min \u0012 𝜋𝜃(𝑜𝑖|𝑞) 𝜋𝜃𝑜𝑙𝑑 (𝑜𝑖|𝑞) 𝐴𝑖, clip \u0012 𝜋𝜃(𝑜𝑖|𝑞) 𝜋𝜃𝑜𝑙𝑑 (𝑜𝑖|𝑞) , 1 − 𝜀, 1 + 𝜀 \u0013 𝐴𝑖 \u0013 − 𝛽D𝐾 𝐿 𝜋𝜃|| 𝜋𝑟𝑒 𝑓 \u0001\u0013 , (32) D𝐾 𝐿 𝜋𝜃|| 𝜋𝑟𝑒 𝑓 \u0001 = 𝜋𝑟𝑒 𝑓 (𝑜𝑖|𝑞) 𝜋𝜃(𝑜𝑖|𝑞) − log 𝜋𝑟𝑒 𝑓 (𝑜𝑖|𝑞) 𝜋𝜃(𝑜𝑖|𝑞) − 1, (33) where 𝜀 and 𝛽 are hyper-parameters; and 𝐴𝑖 is the advantage, computed using a group of rewards {𝑟1, 𝑟2, . . . , 𝑟𝐺} corresponding to the outputs within each group: 𝐴𝑖 = 𝑟𝑖 − m𝑒𝑎𝑛({𝑟1, 𝑟2, · · ·, 𝑟𝐺}) s𝑡𝑑 ({𝑟1, 𝑟2, · · ·, 𝑟𝐺}) . (34) Training Strategy. In our preliminary experiments, we find that the RL training on reasoning"},{"citing_arxiv_id":"2403.04652","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Yi: Open Foundation Models by 01.AI","primary_cat":"cs.CL","submitted_at":"2024-03-07T16:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"strongly influenced by the assessment criteria and the design of the prompt. Our internal evaluation results may be unfair to other models, making it difficult to accurately represent the true capability level of our model. Therefore, here we only present external evaluation results to demonstrate the current conversational abilities of our chat model. We consider: (1). AlapcaEval 1 [44], which is designed to assess the English conversation capabilities of models by comparing the responses of a specified model to reference replies from Davinci003 [ 21] in order to calculate a win-rate; (2). LMSys2 [93] Chatbot Arena, which showcases the responses of different models through a dialogue platform, then asks users to make selections based on their preferences, then computes the Elo score;"},{"citing_arxiv_id":"2401.06066","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models","primary_cat":"cs.CL","submitted_at":"2024-01-11T17:31:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.02954","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism","primary_cat":"cs.CL","submitted_at":"2024-01-05T18:59:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.10305","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Baichuan 2: Open Large-scale Language Models","primary_cat":"cs.CL","submitted_at":"2023-09-19T04:13:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}