Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
arXiv preprint arXiv:2510.26768 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9verdicts
UNVERDICTED 9representative citing papers
Presents ComBench, a 100-problem Olympiad combinatorics benchmark with dual analysis and construction tracks, reporting top model scores of 65.4% Avg / 75.3% Best@4 and distinct capabilities between proof and construction.
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
TEMA is the first framework for multi-modification composed image retrieval, using entity mapping to improve accuracy on both new complex datasets and existing benchmarks while balancing efficiency.
RA-RFT trains a retriever to rank contexts by expected reasoning benefit and uses the retrieved analogies inside reinforcement fine-tuning, yielding 7.1 and 2.8 point gains on AIME 2025 over GRPO for two Qwen3 models.
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
SemanticQA unifies prior multiword expression datasets into a benchmark that reveals substantial performance variation among language models on semantic reasoning tasks.
Zero-shot prompting reaches 59% accuracy at moderate temperatures while chain-of-thought prompting excels at temperature extremes on Olympiad-level math problems, with extended reasoning gains scaling to 14.3x at high temperature.
citing papers explorer
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
-
ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics
Presents ComBench, a 100-problem Olympiad combinatorics benchmark with dual analysis and construction tracks, reporting top model scores of 65.4% Avg / 75.3% Best@4 and distinct capabilities between proof and construction.
-
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
-
TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
TEMA is the first framework for multi-modification composed image retrieval, using entity mapping to improve accuracy on both new complex datasets and existing benchmarks while balancing efficiency.
-
Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning
RA-RFT trains a retriever to rank contexts by expected reasoning benefit and uses the retrieved analogies inside reinforcement fine-tuning, yielding 7.1 and 2.8 point gains on AIME 2025 over GRPO for two Qwen3 models.
-
Self-Supervised On-Policy Distillation for Reasoning Language Models
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
-
Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
SemanticQA unifies prior multiword expression datasets into a benchmark that reveals substantial performance variation among language models on semantic reasoning tasks.
-
Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Zero-shot prompting reaches 59% accuracy at moderate temperatures while chain-of-thought prompting excels at temperature extremes on Olympiad-level math problems, with extended reasoning gains scaling to 14.3x at high temperature.