hub

Supergpqa: Scaling llm evaluation across 285 graduate disciplines

P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Y · 2025 · arXiv 2502.14739

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a competitive 23A2B model from Qwen3-Next-80A3B.

Rotation-Preserving Supervised Fine-Tuning

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-Literature.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Heterogeneous Scientific Foundation Model Collaboration

cs.AI · 2026-04-30 · unverdicted · novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

Qwen3.5-Omni Technical Report

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.

MiMo-V2-Flash Technical Report

cs.CL · 2026-01-06 · unverdicted · novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.

Kimi K2: Open Agentic Intelligence

cs.LG · 2025-07-28 · unverdicted · novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

Qwen3 Technical Report

cs.CL · 2025-05-14 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

Supplement Generation Training for Enhancing Agentic Task Performance

cs.LG · 2026-04-22 · unverdicted · novelty 4.0

SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.

citing papers explorer

Showing 14 of 14 citing papers.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 61
Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a competitive 23A2B model from Qwen3-Next-80A3B.
Rotation-Preserving Supervised Fine-Tuning cs.LG · 2026-05-08 · unverdicted · none · ref 37
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training cs.AI · 2026-05-08 · unverdicted · none · ref 36
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning cs.AI · 2026-05-02 · unverdicted · none · ref 24
SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-Literature.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 97
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 33
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Heterogeneous Scientific Foundation Model Collaboration cs.AI · 2026-04-30 · unverdicted · none · ref 37
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
Qwen3.5-Omni Technical Report cs.CL · 2026-04-17 · unverdicted · none · ref 38
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
MiMo-V2-Flash Technical Report cs.CL · 2026-01-06 · unverdicted · none · ref 13
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 14
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Qwen3 Technical Report cs.CL · 2025-05-14 · unverdicted · none · ref 11
Pith review generated a malformed one-line summary.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 168
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 24
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
Supplement Generation Training for Enhancing Agentic Task Performance cs.LG · 2026-04-22 · unverdicted · none · ref 42
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.

Supergpqa: Scaling llm evaluation across 285 graduate disciplines

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer