QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

Marjan Veysi; Mohammad Sabouri; Mohammad Zare; Pirooz Shamsinejadbabaki

arxiv: 2605.17382 · v1 · pith:BQ6FQZ3Fnew · submitted 2026-05-17 · 💻 cs.AI · cs.CL· cs.GR

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

Marjan Veysi , Pirooz Shamsinejadbabaki , Mohammad Zare , Mohammad Sabouri This is my paper

Pith reviewed 2026-05-20 13:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.GR

keywords generative AI evaluationLLM as evaluatorhuman-aligned metricsrubric-based assessmentqualitative judgmentscalable evaluationtext generationimage generation

0 comments

The pith

QQJ uses expert-designed rubrics and small annotation sets to calibrate LLM evaluators for stronger human alignment on generative AI outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes QQJ to solve the mismatch between current evaluation methods and actual human views of quality in open-ended generative tasks. Traditional metrics focus on surface similarity and miss deeper aspects of quality, full human review is too slow and variable for large-scale use, and direct LLM judges often drift from consistent principles. QQJ first has experts create explicit multi-dimensional rubrics that spell out quality criteria, then tunes an LLM evaluator on a modest number of expert-annotated examples so its judgments follow those criteria. Experiments on text and image generation show the resulting scores line up better with humans, stay stable across repeats, and more reliably flag problems such as hallucinations or mismatched intent.

Core claim

QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities, with experiments demonstrating substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators, along with improved stability and better detection of failure modes such as hallucination and intent mismatch.

What carries the argument

Expert-designed multi-dimensional rubrics that define quality dimensions, paired with calibration of an LLM evaluator on a small high-quality set of human annotations to reproduce expert reasoning.

If this is right

Evaluations become more stable when the same outputs are judged multiple times.
The method identifies specific failure modes such as hallucination and intent mismatch more effectively than baselines.
The same rubric-plus-calibration process applies to both text and image generation with consistent human alignment.
Structured qualitative judgment can be scaled without losing interpretability or closeness to human preferences.
The approach supplies a practical basis for reliable ongoing evaluation of generative AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider adoption could replace many large-scale human annotation campaigns with smaller expert sets plus automated judges.
The same separation of rubric definition and calibrated execution might transfer to evaluating code or scientific text generation.
Clearer signals from such evaluations could guide model training toward outputs that humans actually rate higher.

Load-bearing premise

A small, high-quality set of expert annotations is sufficient to calibrate an LLM evaluator so that it aligns with expert reasoning across varied generative tasks and modalities without introducing bias or inconsistency.

What would settle it

Running QQJ on a fresh set of generative tasks and finding that its calibrated evaluator matches human expert ratings no better than standard automatic metrics or uncalibrated LLMs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.17382 by Marjan Veysi, Mohammad Sabouri, Mohammad Zare, Pirooz Shamsinejadbabaki.

**Figure 1.** Figure 1: Conceptual comparison of generaƟve outputs without and with structured qualitaƟve evaluaƟon. While surface [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: End-to-end pipeline of the QQJ framework. Human experts define qualitative evaluation criteria and annotate a calibration set. A large language model evaluator is aligned with expert reasoning and subsequently deployed for scalable evaluation of generative outputs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: EvaluaƟon stability measured as variance across repeated runs. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 3.** Figure 3: CorrelaƟon with human judgment across evaluaƟon methods. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Example of rubric-level qualitative evaluation produced by QQJ. To further emphasize the impact of QQJ on evaluation outcomes, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visual matrix comparison of generaƟve output evaluaƟon without and with QQJ. Structured qualitaƟve judgment leads [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QQJ uses expert rubrics plus small-set LLM calibration to ground generative AI evaluation, but the abstract leaves performance claims unverified.

read the letter

The punchline is that QQJ combines expert rubrics with calibration of LLMs on a small annotation set to create more reliable evaluators for generative AI outputs. This is new in how it explicitly separates the quality definition from the automated execution. The authors design multi-dimensional rubrics with experts and then use a limited high-quality set to align the LLM's judgments with expert reasoning. This addresses the inconsistency problems in standard LLM-as-judge setups. The paper does well in identifying the gaps in current evaluation methods for open-ended tasks. It points out how automatic metrics miss human perceptions and how unconstrained LLMs can be biased. The claims of better alignment, stability, and ability to diagnose failures like hallucination are promising if supported by data. On the soft spots, the abstract offers no numbers or specifics about the experiments on text and image generation. Without details on set size, calibration method, or statistical tests, it's unclear if the small set truly generalizes or if results are tied to the particular annotations used. The risk of bias or inconsistency across modalities is a real concern that needs checking. This paper is aimed at AI researchers and practitioners who evaluate generative models and need something more scalable than full human studies but more grounded than raw LLM judgments. A reader focused on evaluation techniques would find the rubric anchoring idea practical. The work shows clear thinking on the problem and engages with relevant issues in the literature. It deserves a serious referee to review the full methods and results. I recommend sending it to peer review rather than desk rejecting it.

Referee Report

2 major / 1 minor

Summary. The paper introduces QQJ, a framework that defines evaluation quality via expert-designed multi-dimensional rubrics and calibrates LLM evaluators to match expert reasoning using a small high-quality annotation set. This is claimed to yield consistent, interpretable, scalable evaluation across text and image generation tasks, with experiments showing substantially stronger human alignment, improved stability, and better diagnostics for failures like hallucination and intent mismatch than traditional automatic metrics or unconstrained LLM evaluators.

Significance. If the empirical claims hold after providing the missing quantitative details, QQJ would address a key gap in GenAI evaluation by operationalizing structured qualitative judgment at scale while preserving interpretability and human alignment. The separation of rubric design from LLM execution is a clear strength that could reduce costs relative to full human evaluation.

major comments (2)

[Abstract and §5] Abstract and §5 (Experiments): the central claim of 'substantially stronger alignment with human judgment' is asserted without any reported quantitative metrics, error bars, dataset sizes, selection criteria for the annotation set, or statistical significance tests comparing QQJ to baselines. This directly affects verifiability of the performance gains.
[§4] §4 (Calibration Procedure): the description of how the small annotation set calibrates the LLM evaluator omits the set size, selection process, exact technique (prompt engineering vs. fine-tuning), and any held-out tests for overfitting, modality-specific bias, or inconsistency. These details are load-bearing for the claim that the approach generalizes reliably across diverse generative tasks without artifacts from the calibration distribution.

minor comments (1)

[Figure 2 and Table 1] Figure 2 and Table 1: axis labels and rubric dimension definitions should be expanded for readers unfamiliar with the specific generative tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment below and revised the manuscript to incorporate the requested clarifications and quantitative details.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experiments): the central claim of 'substantially stronger alignment with human judgment' is asserted without any reported quantitative metrics, error bars, dataset sizes, selection criteria for the annotation set, or statistical significance tests comparing QQJ to baselines. This directly affects verifiability of the performance gains.

Authors: We agree that the absence of explicit quantitative metrics, error bars, dataset sizes, selection criteria, and statistical tests in the abstract and §5 limits verifiability of the alignment claims. In the revised manuscript we have added these elements: specific correlation coefficients with human judgments, standard errors from repeated evaluations, the calibration annotation set size and its selection criteria (stratified sampling for diversity across tasks and modalities), and statistical significance results comparing QQJ against baselines. These additions are now present in both the abstract and §5. revision: yes
Referee: [§4] §4 (Calibration Procedure): the description of how the small annotation set calibrates the LLM evaluator omits the set size, selection process, exact technique (prompt engineering vs. fine-tuning), and any held-out tests for overfitting, modality-specific bias, or inconsistency. These details are load-bearing for the claim that the approach generalizes reliably across diverse generative tasks without artifacts from the calibration distribution.

Authors: We acknowledge that §4 lacked sufficient detail on the calibration procedure. The revised manuscript now specifies the annotation set size, the stratified selection process used to ensure coverage of task types and modalities, the exact calibration technique (in-context prompt engineering with examples drawn from the set, without fine-tuning), and results from held-out validation including checks for overfitting, cross-modality consistency, and run-to-run stability. These additions directly support the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework grounded in external expert annotations

full rationale

The QQJ framework is defined by separating quality definition (expert-designed multi-dimensional rubrics) from execution (LLM calibration on a small high-quality annotation set). No equations, derivations, or self-citations appear in the provided text that reduce the alignment claims to fitted parameters or inputs defined from the same data. The central results rely on external human expert grounding and experiments on text/image generation, which are independent of the method's own outputs. This is a self-contained operationalization without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework depends on the assumption that expert rubrics can be defined to capture human notions of quality and that limited annotations suffice for reliable LLM calibration across modalities.

axioms (2)

domain assumption Expert-designed multi-dimensional rubrics accurately represent human perceptions of quality for open-ended generative tasks.
The method separates definition of quality from execution by anchoring in these rubrics.
domain assumption A small high-quality annotation set can calibrate LLMs to match expert reasoning without overfitting or bias.
Calibration step is presented as the mechanism for human alignment.

pith-pipeline@v0.9.0 · 5803 in / 1242 out tokens · 32509 ms · 2026-05-20T13:08:28.733800+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 3 internal anchors

[1]

No Yes No No Text 2002 – 2004 FID / IS [18],

work page 2002
[2]

No Yes No No Image 2016 – 2017 GEM Benchmark

work page 2016
[3]

Yes No Yes No Text 2023 HELM

work page 2023
[4]

Yes Partial Yes No Multi 2023 BIG - Bench

work page 2023
[5]

Yes Partial Yes No Text 2023 MT - Bench

work page 2023
[6]

Partial Yes Partial No Text 2023 G - Eval

work page 2023
[7]

Partial Yes Yes Partial Text 2023 LLM - Based Reviewer Studies

work page 2023
[8]

Partial Yes Partial No Text 2024 Human Preference RLHF [22],

work page 2024
[9]

Yes No Partial Yes Text 2020 – 2022 Proposed Approach (QQJ) Yes Yes Yes Yes Multi This work Fig

work page 2020
[10]

GPT-4 Technical Report

OpenAI, “Gpt - 4 technical report,” arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng et al. , “Judging llms as a judge with mt - bench,” arXiv preprint arXiv:2306.05685 ,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Holistic Evaluation of Language Models

P. Liang et al. , “Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110 ,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

No Yes No No Text 2002 – 2004 FID / IS [18],

work page 2002

[2] [2]

No Yes No No Image 2016 – 2017 GEM Benchmark

work page 2016

[3] [3]

Yes No Yes No Text 2023 HELM

work page 2023

[4] [4]

Yes Partial Yes No Multi 2023 BIG - Bench

work page 2023

[5] [5]

Yes Partial Yes No Text 2023 MT - Bench

work page 2023

[6] [6]

Partial Yes Partial No Text 2023 G - Eval

work page 2023

[7] [7]

Partial Yes Yes Partial Text 2023 LLM - Based Reviewer Studies

work page 2023

[8] [8]

Partial Yes Partial No Text 2024 Human Preference RLHF [22],

work page 2024

[9] [9]

Yes No Partial Yes Text 2020 – 2022 Proposed Approach (QQJ) Yes Yes Yes Yes Multi This work Fig

work page 2020

[10] [10]

GPT-4 Technical Report

OpenAI, “Gpt - 4 technical report,” arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng et al. , “Judging llms as a judge with mt - bench,” arXiv preprint arXiv:2306.05685 ,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Holistic Evaluation of Language Models

P. Liang et al. , “Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110 ,

work page internal anchor Pith review Pith/arXiv arXiv