pith. sign in

arxiv: 2507.20208 · v2 · pith:C2EF77FXnew · submitted 2025-07-27 · 💻 cs.CL

From Benchmarks to Skills: Low-Rank Factors for LLM Evaluation

classification 💻 cs.CL
keywords benchmarksmodelsevaluationfactorslatentlow-ranksmalltasks
0
0 comments X
read the original abstract

Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores reveal about models' underlying capabilities. Here, we propose a new paradigm for LLM evaluation, by asking whether benchmark performance reflects many independent abilities, or rather relies on a small number of shared dimensions. To answer this, we apply Factor Analysis (FA) to a massive performance matrix of LLMs versus benchmarks \((60\times44)\) revealing an \emph{intrinsically low-rank} structure of that matrix. That is, a small number of latent factors captures most of the structure in the full task space. This low-rank geometry reveals substantial redundancy across existing tasks and explains why many benchmarks appear to be measuring overlapping abilities. We further show that these latent factors correspond to coherent, skill-like, dimensions of LLM behavior. Leveraging this latent skill-space, we deliver three practical tools for LLM evaluation and downstream users: (i)~identifying redundant tasks, (ii)~profiling new models using a small subset of tasks, and (iii)~selecting models aligned with desired skill profiles. Our method provides a solid alternative to the de-facto standard of a single aggregate score, and establishes an interpretable and practical framework for understanding and benchmarking LLM core capabilities.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Visual Fingerprints for LLM Generation Comparison

    cs.AI 2026-05 unverdicted novelty 6.0

    Visual fingerprints represent distributions of linguistic choices extracted from repeated LLM samples to enable direct comparison of behaviors under different generation conditions.

  2. Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

    cs.CL 2026-04 unverdicted novelty 6.0

    A fixed-parameter multidimensional IRT calibration approach allows extending LLM benchmark suites over time, predicting full performance within 2-3 points and preserving rankings (Spearman ρ ≥ 0.9) using only 100 anch...