Crystal: Characterizing Relative Impact of Scholarly Publications
Pith reviewed 2026-05-15 00:56 UTC · model grok-4.3
The pith
Crystal improves citation impact classification by jointly ranking all references with LLMs and majority voting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Crystal shows that prompting an LLM to produce a single ranked list of all cited papers within one citing paper, repeated three times with shuffled order and aggregated by majority vote, yields impact labels that align more closely with human judgments than classifiers that score each citation independently.
What carries the argument
Joint LLM ranking of all citations in a paper, with randomized order and majority-vote aggregation to produce relative impact labels.
If this is right
- Impact labels become available for every citation without extra human annotation.
- Fewer LLM calls are needed per citing paper than independent scoring methods.
- Open-source models can be substituted while keeping competitive performance.
- Large-scale citation databases can be labeled more cheaply for downstream tasks.
Where Pith is reading between the lines
- Citation-based metrics such as h-index or journal impact factors could be refined by weighting references according to these relative ranks.
- The same joint-ranking pattern might be tested on other ranking tasks where context spans multiple items, such as reference recommendation.
- Domain-specific fine-tuning of the open-source model could further reduce disagreement with field experts.
Load-bearing premise
That the majority vote from three randomized LLM rankings of the same citation list produces labels that match what human readers would judge as relative scholarly impact.
What would settle it
A new set of human-annotated citations in which the majority-voted LLM labels disagree with the human rankings on more than half the examples would show the method does not reliably capture relative impact.
read the original abstract
Assessing a cited paper's impact is typically done by analyzing its citation context in isolation within the citing paper. While this focuses on the most directly relevant text, it prevents relative comparisons across all the works a paper cites. We propose Crystal, which instead jointly ranks all cited papers within a citing paper using large language models (LLMs). To mitigate LLMs' positional bias, we rank each list three times in a randomized order and aggregate the impact labels through majority voting. This joint approach leverages the full citation context, rather than evaluating citations independently, to more reliably distinguish impactful references. Crystal outperforms a prior state-of-the-art impact classifier by +9.5% accuracy and +8.3% F1 on a dataset of human-annotated citations. Crystal further gains efficiency through fewer LLM calls and outperforms prior baselines using an open-weight model, enabling scalable, cost-effective citation impact analysis. In a case study of ACL Test-of-Time award-winning papers, we find that Crystal's impact characterizations align closely with long-term scientific recognition. We release Crystal-Bank, a 46.8k-paper dataset with rankings and impact labels, along with code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Crystal, a method to assess relative scholarly impact by jointly ranking all cited papers in a citing paper using LLMs, with three randomized-order rankings aggregated via majority voting to reduce positional bias. It claims this joint approach outperforms a prior state-of-the-art impact classifier by +9.5% accuracy and +8.3% F1 on a human-annotated citation dataset, while also offering efficiency gains through fewer LLM calls and competitive results with open-source models. The authors release their rankings, impact labels, and codebase.
Significance. If the results hold, the work advances citation analysis by enabling relative comparisons across an entire reference list rather than isolated contexts, potentially improving the reliability of impact characterization in bibliometrics. The open release of data and code is a clear strength that supports reproducibility and extension by the community. Efficiency improvements could facilitate scalable applications in large scholarly corpora.
major comments (2)
- [Abstract] Abstract: the central performance claim (+9.5% accuracy, +8.3% F1 over prior SOTA) is presented without any information on dataset size, number of annotators, annotation guidelines, inter-annotator agreement (e.g., Cohen or Fleiss kappa), or statistical significance tests. These details are load-bearing for evaluating whether the human labels constitute a stable ground truth for relative impact.
- [Abstract] Abstract / Evaluation section: the majority-voting aggregation over three randomized rankings is described at a high level but lacks an ablation or analysis showing that three runs suffice to mitigate LLM positional bias, or how ties in the vote are resolved; without this, the reliability of the joint-ranking advantage over independent classification remains unverified.
minor comments (2)
- [Abstract] The abstract would be strengthened by a one-sentence summary of dataset characteristics (e.g., number of citing papers and total citations annotated).
- Consider adding a limitations paragraph discussing potential LLM-specific artifacts (e.g., hallucinated impact judgments) and how they were mitigated beyond randomization.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and describe the revisions we will make to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claim (+9.5% accuracy, +8.3% F1 over prior SOTA) is presented without any information on dataset size, number of annotators, annotation guidelines, inter-annotator agreement (e.g., Cohen or Fleiss kappa), or statistical significance tests. These details are load-bearing for evaluating whether the human labels constitute a stable ground truth for relative impact.
Authors: We agree that the abstract should include key details on the evaluation dataset to allow readers to assess ground-truth stability. The Evaluation section of the manuscript already reports the dataset size, number of annotators, annotation guidelines, inter-annotator agreement, and statistical significance tests. We will revise the abstract to concisely state the dataset size, number of annotators, and inter-annotator agreement while directing readers to the full details and significance results in the main text. revision: yes
-
Referee: [Abstract] Abstract / Evaluation section: the majority-voting aggregation over three randomized rankings is described at a high level but lacks an ablation or analysis showing that three runs suffice to mitigate LLM positional bias, or how ties in the vote are resolved; without this, the reliability of the joint-ranking advantage over independent classification remains unverified.
Authors: We acknowledge that an explicit ablation would strengthen the justification for using three randomized rankings. In the revised manuscript we will add an ablation study in the Evaluation section comparing performance and bias reduction across one, three, and five runs. We will also explicitly describe the tie-resolution rule (random selection among tied labels) in the Methods section. These additions will provide direct evidence for the reliability of the majority-voting procedure. revision: yes
Circularity Check
No circularity: empirical LLM ranking method evaluated on external human annotations
full rationale
The paper describes an empirical approach (Crystal) that uses LLMs for joint ranking of cited papers within a citing paper, with randomized ordering and majority-vote aggregation to reduce positional bias. Performance is measured by accuracy and F1 improvement over a prior classifier on a held-out dataset of human-annotated citations. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the abstract or described method. The central claim rests on comparison to independent external labels rather than any internal reduction to the method's own inputs or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can be prompted to assess relative scholarly impact reliably when given joint context
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Crystal, which instead jointly ranks all cited papers within a citing paper using large language models (LLMs). To mitigate LLMs' positional bias, we rank each list three times in a randomized order and aggregate the impact labels through majority voting.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.