arXiv preprint arXiv:2510.26768 , year=

Accessed: · 2026 · arXiv 2510.26768

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Presents ComBench, a 100-problem Olympiad combinatorics benchmark with dual analysis and construction tracks, reporting top model scores of 65.4% Avg / 75.3% Best@4 and distinct capabilities between proof and construction.

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

TEMA is the first framework for multi-modification composed image retrieval, using entity mapping to improve accuracy on both new complex datasets and existing benchmarks while balancing efficiency.

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

RA-RFT trains a retriever to rank contexts by expected reasoning benefit and uses the retrieved analogies inside reinforcement fine-tuning, yielding 7.1 and 2.8 point gains on AIME 2025 over GRPO for two Qwen3 models.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

Seed1.8 Model Card: Towards Generalized Real-World Agency

cs.AI · 2026-03-21 · unverdicted · novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

cs.CL · 2026-04-17 · unverdicted · novelty 4.0 · 2 refs

SemanticQA unifies prior multiword expression datasets into a benchmark that reveals substantial performance variation among language models on semantic reasoning tasks.

Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

cs.CL · 2026-03-18 · unverdicted · novelty 3.0

Zero-shot prompting reaches 59% accuracy at moderate temperatures while chain-of-thought prompting excels at temperature extremes on Olympiad-level math problems, with extended reasoning gains scaling to 14.3x at high temperature.

citing papers explorer

Showing 9 of 9 citing papers.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 5 · 2 links
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics cs.AI · 2026-06-09 · unverdicted · none · ref 2
Presents ComBench, a 100-problem Olympiad combinatorics benchmark with dual analysis and construction tracks, reporting top model scores of 65.4% Avg / 75.3% Best@4 and distinct capabilities between proof and construction.
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data cs.LG · 2026-05-26 · unverdicted · none · ref 3
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval cs.CV · 2026-04-23 · unverdicted · none · ref 10
TEMA is the first framework for multi-modification composed image retrieval, using entity mapping to improve accuracy on both new complex datasets and existing benchmarks while balancing efficiency.
Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning cs.CL · 2026-06-11 · unverdicted · none · ref 53
RA-RFT trains a retriever to rank contexts by expected reasoning benefit and uses the retrieved analogies inside reinforcement fine-tuning, yielding 7.1 and 2.8 point gains on AIME 2025 over GRPO for two Qwen3 models.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 37
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
Seed1.8 Model Card: Towards Generalized Real-World Agency cs.AI · 2026-03-21 · unverdicted · none · ref 3
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models cs.CL · 2026-04-17 · unverdicted · none · ref 2 · 2 links
SemanticQA unifies prior multiword expression datasets into a benchmark that reveals substantial performance variation among language models on semantic reasoning tasks.
Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models cs.CL · 2026-03-18 · unverdicted · none · ref 21
Zero-shot prompting reaches 59% accuracy at moderate temperatures while chain-of-thought prompting excels at temperature extremes on Olympiad-level math problems, with extended reasoning gains scaling to 14.3x at high temperature.

arXiv preprint arXiv:2510.26768 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer