Toward generalizable evaluation in the llm era: A survey beyond benchmarks

Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, et al · 2025 · arXiv 2504.18838

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Security in LLM-as-a-Judge: A Comprehensive SoK

cs.CR · 2026-03-31 · accept · novelty 8.0

The first SoK on LLM-as-a-Judge security organizes attacks targeting judges, attacks using judges, defenses leveraging judges, and security-domain applications while flagging vulnerabilities.

Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.

Hint Tuning: Less Data Makes Better Reasoners

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Hint Tuning uses an instruct model as a difficulty probe to create 1K multi-level hint examples that train reasoning models to calibrate chain-of-thought length, cutting tokens by 31.5% on average across 4B-32B models without accuracy loss.

Reasoning emerges from constrained inference manifolds in large language models

cs.LG · 2026-05-02 · unverdicted · novelty 5.0

Reasoning in LLMs emerges from inference dynamics forming constrained low-dimensional manifolds that preserve non-degenerate information volume, rather than from compression alone.

citing papers explorer

Showing 4 of 4 citing papers.

Security in LLM-as-a-Judge: A Comprehensive SoK cs.CR · 2026-03-31 · accept · none · ref 7
The first SoK on LLM-as-a-Judge security organizes attacks targeting judges, attacks using judges, defenses leveraging judges, and security-domain applications while flagging vulnerabilities.
Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks cs.LG · 2026-05-02 · unverdicted · none · ref 41
EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
Hint Tuning: Less Data Makes Better Reasoners cs.CL · 2026-05-09 · unverdicted · none · ref 42
Hint Tuning uses an instruct model as a difficulty probe to create 1K multi-level hint examples that train reasoning models to calibrate chain-of-thought length, cutting tokens by 31.5% on average across 4B-32B models without accuracy loss.
Reasoning emerges from constrained inference manifolds in large language models cs.LG · 2026-05-02 · unverdicted · none · ref 10
Reasoning in LLMs emerges from inference dynamics forming constrained low-dimensional manifolds that preserve non-degenerate information volume, rather than from compression alone.

Toward generalizable evaluation in the llm era: A survey beyond benchmarks

fields

years

verdicts

representative citing papers

citing papers explorer