Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, Tony Xia · 2023 · arXiv 2305.12524

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

dataset 2 background 1 other 1

citation-polarity summary

unclear 2 background 1 use dataset 1

representative citing papers

FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

FinVQA is a new multilingual benchmark for Indic financial VQA with three difficulty levels and four formats, paired with the FIND framework for faithful numerical reasoning via fine-tuning and constrained decoding.

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.

Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

cs.CL · 2023-09-11 · conditional · novelty 6.0

MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

cs.LG · 2026-06-09 · unverdicted · novelty 5.0

EEVEE introduces a router-based multi-dataset test-time prompt learning framework for LLM agents that uses router-prompt co-evolution to improve robustness on heterogeneous data streams.

Detecting Language Model Attacks with Perplexity

cs.CL · 2023-08-27 · unverdicted · novelty 5.0

Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

cs.AI · 2026-04-04

citing papers explorer

Showing 10 of 10 citing papers.

FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages cs.CL · 2026-05-13 · unverdicted · none · ref 1
FinVQA is a new multilingual benchmark for Indic financial VQA with three difficulty levels and four formats, paired with the FIND framework for faithful numerical reasoning via fine-tuning and constrained decoding.
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics cs.AI · 2026-05-09 · unverdicted · none · ref 3
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 2
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 83
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning cs.LG · 2026-04-22 · unverdicted · none · ref 2
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs cs.CL · 2026-04-14 · unverdicted · none · ref 8
Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning cs.CL · 2023-09-11 · conditional · none · ref 6
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents cs.LG · 2026-06-09 · unverdicted · none · ref 5
EEVEE introduces a router-based multi-dataset test-time prompt learning framework for LLM agents that uses router-prompt co-evolution to improve robustness on heterogeneous data streams.
Detecting Language Model Attacks with Perplexity cs.CL · 2023-08-27 · unverdicted · none · ref 64
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning cs.AI · 2026-04-04 · unreviewed · ref 9

Theoremqa: A theorem-driven question answering dataset

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer