Crystal: Characterizing Relative Impact of Scholarly Publications

Benjamin Van Durme; Daniel Khashabi; Hannah Collison

arxiv: 2603.26791 · v3 · pith:TANZ5ZWVnew · submitted 2026-03-25 · 💻 cs.DL · cs.AI· cs.CL· cs.CY

Crystal: Characterizing Relative Impact of Scholarly Publications

Hannah Collison , Benjamin Van Durme , Daniel Khashabi This is my paper

Pith reviewed 2026-05-15 00:56 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.CLcs.CY

keywords citation impactLLM rankingmajority votingrelative impactscholarly publicationscitation analysisjoint ranking

0 comments

The pith

Crystal improves citation impact classification by jointly ranking all references with LLMs and majority voting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Crystal, a method that prompts large language models to rank every cited paper together inside each citing paper rather than judging citations one at a time. It runs the ranking three times with randomized orders and keeps the label that wins the majority vote, which reduces the models' tendency to favor early positions in the list. This joint view lets the model use the full surrounding text of the citing paper to decide relative impact. Tested on human-annotated citations, the method raises accuracy by 9.5 points and F1 by 8.3 points over the prior best isolated classifier. The same procedure also uses fewer model calls and works with open-source models, lowering the cost of large-scale analysis.

Core claim

Crystal shows that prompting an LLM to produce a single ranked list of all cited papers within one citing paper, repeated three times with shuffled order and aggregated by majority vote, yields impact labels that align more closely with human judgments than classifiers that score each citation independently.

What carries the argument

Joint LLM ranking of all citations in a paper, with randomized order and majority-vote aggregation to produce relative impact labels.

If this is right

Impact labels become available for every citation without extra human annotation.
Fewer LLM calls are needed per citing paper than independent scoring methods.
Open-source models can be substituted while keeping competitive performance.
Large-scale citation databases can be labeled more cheaply for downstream tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Citation-based metrics such as h-index or journal impact factors could be refined by weighting references according to these relative ranks.
The same joint-ranking pattern might be tested on other ranking tasks where context spans multiple items, such as reference recommendation.
Domain-specific fine-tuning of the open-source model could further reduce disagreement with field experts.

Load-bearing premise

That the majority vote from three randomized LLM rankings of the same citation list produces labels that match what human readers would judge as relative scholarly impact.

What would settle it

A new set of human-annotated citations in which the majority-voted LLM labels disagree with the human rankings on more than half the examples would show the method does not reliably capture relative impact.

read the original abstract

Assessing a cited paper's impact is typically done by analyzing its citation context in isolation within the citing paper. While this focuses on the most directly relevant text, it prevents relative comparisons across all the works a paper cites. We propose Crystal, which instead jointly ranks all cited papers within a citing paper using large language models (LLMs). To mitigate LLMs' positional bias, we rank each list three times in a randomized order and aggregate the impact labels through majority voting. This joint approach leverages the full citation context, rather than evaluating citations independently, to more reliably distinguish impactful references. Crystal outperforms a prior state-of-the-art impact classifier by +9.5% accuracy and +8.3% F1 on a dataset of human-annotated citations. Crystal further gains efficiency through fewer LLM calls and outperforms prior baselines using an open-weight model, enabling scalable, cost-effective citation impact analysis. In a case study of ACL Test-of-Time award-winning papers, we find that Crystal's impact characterizations align closely with long-term scientific recognition. We release Crystal-Bank, a 46.8k-paper dataset with rankings and impact labels, along with code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Crystal's joint LLM ranking of citations within one paper, with order randomization and majority vote, beats an isolated classifier by 9-10 points on accuracy and F1, but the human labels lack reported agreement or protocol details.

read the letter

Crystal's main move is to rank every citation inside a single paper at once with an LLM instead of scoring them one by one. They shuffle the list three times, run the prompt each time, and take the majority label to reduce positional bias. On human-annotated data this gives a 9.5-point accuracy lift and 8.3-point F1 lift over a prior state-of-the-art isolated classifier. They also note fewer total LLM calls and competitive results with an open model, and they release the rankings, labels, and code.

Referee Report

2 major / 2 minor

Summary. The paper proposes Crystal, a method to assess relative scholarly impact by jointly ranking all cited papers in a citing paper using LLMs, with three randomized-order rankings aggregated via majority voting to reduce positional bias. It claims this joint approach outperforms a prior state-of-the-art impact classifier by +9.5% accuracy and +8.3% F1 on a human-annotated citation dataset, while also offering efficiency gains through fewer LLM calls and competitive results with open-source models. The authors release their rankings, impact labels, and codebase.

Significance. If the results hold, the work advances citation analysis by enabling relative comparisons across an entire reference list rather than isolated contexts, potentially improving the reliability of impact characterization in bibliometrics. The open release of data and code is a clear strength that supports reproducibility and extension by the community. Efficiency improvements could facilitate scalable applications in large scholarly corpora.

major comments (2)

[Abstract] Abstract: the central performance claim (+9.5% accuracy, +8.3% F1 over prior SOTA) is presented without any information on dataset size, number of annotators, annotation guidelines, inter-annotator agreement (e.g., Cohen or Fleiss kappa), or statistical significance tests. These details are load-bearing for evaluating whether the human labels constitute a stable ground truth for relative impact.
[Abstract] Abstract / Evaluation section: the majority-voting aggregation over three randomized rankings is described at a high level but lacks an ablation or analysis showing that three runs suffice to mitigate LLM positional bias, or how ties in the vote are resolved; without this, the reliability of the joint-ranking advantage over independent classification remains unverified.

minor comments (2)

[Abstract] The abstract would be strengthened by a one-sentence summary of dataset characteristics (e.g., number of citing papers and total citations annotated).
Consider adding a limitations paragraph discussing potential LLM-specific artifacts (e.g., hallucinated impact judgments) and how they were mitigated beyond randomization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and describe the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim (+9.5% accuracy, +8.3% F1 over prior SOTA) is presented without any information on dataset size, number of annotators, annotation guidelines, inter-annotator agreement (e.g., Cohen or Fleiss kappa), or statistical significance tests. These details are load-bearing for evaluating whether the human labels constitute a stable ground truth for relative impact.

Authors: We agree that the abstract should include key details on the evaluation dataset to allow readers to assess ground-truth stability. The Evaluation section of the manuscript already reports the dataset size, number of annotators, annotation guidelines, inter-annotator agreement, and statistical significance tests. We will revise the abstract to concisely state the dataset size, number of annotators, and inter-annotator agreement while directing readers to the full details and significance results in the main text. revision: yes
Referee: [Abstract] Abstract / Evaluation section: the majority-voting aggregation over three randomized rankings is described at a high level but lacks an ablation or analysis showing that three runs suffice to mitigate LLM positional bias, or how ties in the vote are resolved; without this, the reliability of the joint-ranking advantage over independent classification remains unverified.

Authors: We acknowledge that an explicit ablation would strengthen the justification for using three randomized rankings. In the revised manuscript we will add an ablation study in the Evaluation section comparing performance and bias reduction across one, three, and five runs. We will also explicitly describe the tie-resolution rule (random selection among tied labels) in the Methods section. These additions will provide direct evidence for the reliability of the majority-voting procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical LLM ranking method evaluated on external human annotations

full rationale

The paper describes an empirical approach (Crystal) that uses LLMs for joint ranking of cited papers within a citing paper, with randomized ordering and majority-vote aggregation to reduce positional bias. Performance is measured by accuracy and F1 improvement over a prior classifier on a held-out dataset of human-annotated citations. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the abstract or described method. The central claim rests on comparison to independent external labels rather than any internal reduction to the method's own inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that current LLMs can perform reliable comparative impact judgments when given joint citation context; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption LLMs can be prompted to assess relative scholarly impact reliably when given joint context
The method assumes this capability of LLMs without further justification or validation details in the abstract.

pith-pipeline@v0.9.0 · 5470 in / 1216 out tokens · 45621 ms · 2026-05-15T00:56:41.594763+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Crystal, which instead jointly ranks all cited papers within a citing paper using large language models (LLMs). To mitigate LLMs' positional bias, we rank each list three times in a randomized order and aggregate the impact labels through majority voting.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.