Refining and Reusing Annotation Guidelines for LLM Annotation

Akiko Aizawa; Jin-Dong Kim; Kon Woo Kim

arxiv: 2605.20809 · v1 · pith:EWBKCBR4new · submitted 2026-05-20 · 💻 cs.CL

Refining and Reusing Annotation Guidelines for LLM Annotation

Kon Woo Kim , Jin-Dong Kim , Akiko Aizawa This is my paper

Pith reviewed 2026-05-21 05:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords annotation guidelinesLLM annotationbiomedical NERiterative moderationguideline refinementnamed entity recognitionzero-shot annotation

0 comments

The pith

An iterative moderation framework refines annotation guidelines to align LLMs with specialized biomedical conventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes reusing and refining annotation guidelines through an iterative moderation process to help large language models handle the specific rules of gold-standard benchmarks. This framework mimics the initial stages of human annotation projects by having models generate annotations, receive feedback, and update guidelines repeatedly. Experiments on biomedical named entity recognition tasks from NCBI Disease, BC5CDR, and BioRED, using GPT, Gemini, and DeepSeek models, test three ideas: guideline integration improves results, reasoning-optimized models do better, and the process works with little supervision. Results support all three ideas while noting that further improvements are possible.

Core claim

The authors establish that systematically refining annotation guidelines via an iterative moderation framework improves how well LLMs match expert annotation standards in biomedical named entity recognition, with evidence that incorporating guidelines helps, reasoning-focused models have an edge, and the moderation succeeds even when supervision is minimal.

What carries the argument

The iterative moderation framework that simulates early annotation project phases by cycling between LLM annotation, moderator review, and guideline updates.

If this is right

Guideline integration leads to better alignment with gold-standard benchmarks in biomedical NER.
Reasoning optimized LLMs show performance advantages in this setup.
Moderation under minimal supervision proves viable across tested tasks and model families.
The framework has potential for refining guidelines effectively in iterative cycles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar refinement loops could help LLMs adapt to conventions in other specialized domains beyond biomedicine.
Integrating this with existing LLM fine-tuning methods might further reduce reliance on manual annotations.
Testing the framework on larger or more diverse LLM families could reveal scalability limits.

Load-bearing premise

That the chosen biomedical NER tasks and the three LLM families represent the wider difficulties LLMs encounter with specialized gold-standard annotation rules.

What would settle it

Running the same iterative moderation on a different domain-specific annotation task, such as legal entity recognition, and observing no accuracy gains compared to baseline zero-shot prompting would challenge the central claim.

Figures

Figures reproduced from arXiv: 2605.20809 by Akiko Aizawa, Jin-Dong Kim, Kon Woo Kim.

**Figure 2.** Figure 2: Example of LLM-simulated moderation for DiseaseClass guideline refinement. The moderator identifies [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Discrepancy matrices over moderation iterations on NCBI Disease. Same-label cells indicate span [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Example of correctly and falsely changed [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt template and Human guideline. (Left) Prompt template used for LLM-based annotation. (Right) [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Inserted guideline rule after moderation. Full text of the newly added rule (§4) that instructs annotating [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Detailed before/after examples across moderation iterations. Top row: a coordinated [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tests an iterative moderation loop to refine guidelines for LLM annotation on biomedical NER and finds it helps alignment, but the gains may come mostly from the guidelines themselves rather than the iteration.

read the letter

The core takeaway is that adding and iteratively moderating annotation guidelines improves how well LLMs match gold-standard labels on tasks like NCBI Disease, BC5CDR, and BioRED. The tests across GPT, Gemini, and DeepSeek families back the three hypotheses on guideline value, reasoning model advantages, and light-supervision viability, with the authors noting clear remaining gaps in performance.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes an iterative moderation framework to refine and reuse annotation guidelines as a mechanism for aligning LLMs with the specialized conventions of gold-standard benchmarks. It evaluates three hypotheses—efficacy of guideline integration, advantage of reasoning-optimized models, and viability of moderation under minimal supervision—across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) using three LLM families (GPT, Gemini, DeepSeek), reporting empirical confirmation of all hypotheses while noting room for improvement.

Significance. If the gains can be shown to arise specifically from the iterative moderation process rather than from guideline inclusion alone, the work would supply a practical, low-supervision method for adapting LLMs to domain-specific annotation standards. The multi-model, multi-dataset design provides a reasonable test bed for the biomedical NER setting and could generalize to other specialized annotation scenarios.

major comments (1)

[Experimental Evaluation] The experimental design lacks a non-iterative control condition in which the same annotation guidelines are simply concatenated into the prompt without any human-moderated refinement steps. Without this baseline, performance gains on the NCBI Disease, BC5CDR, and BioRED tasks cannot be attributed specifically to the iterative framework, weakening support for hypotheses (1) and (3).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback on experimental controls is particularly helpful for clarifying the contribution of the iterative moderation framework. We address the major comment below and outline the planned revisions.

read point-by-point responses

Referee: [Experimental Evaluation] The experimental design lacks a non-iterative control condition in which the same annotation guidelines are simply concatenated into the prompt without any human-moderated refinement steps. Without this baseline, performance gains on the NCBI Disease, BC5CDR, and BioRED tasks cannot be attributed specifically to the iterative framework, weakening support for hypotheses (1) and (3).

Authors: We agree that the current design does not fully isolate the contribution of the iterative refinement steps from the simple inclusion of guidelines. Hypothesis (1) concerns the efficacy of guideline integration in general, while hypothesis (3) specifically addresses the viability of moderation under minimal supervision; a non-iterative concatenation baseline would indeed help attribute gains more precisely to the iterative process. In the revised manuscript we will add this control condition across the three datasets and model families, reporting the performance difference between (a) direct concatenation of the initial guidelines and (b) the iteratively moderated guidelines. This addition will strengthen the empirical support for both hypotheses without altering the existing results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on multi-model empirical tests

full rationale

The paper introduces an iterative moderation framework and evaluates three hypotheses via experiments on NCBI Disease, BC5CDR, and BioRED tasks using GPT, Gemini, and DeepSeek model families. All central claims are grounded in reported performance outcomes rather than any self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are present that reduce the results to their own inputs by construction. The validation is self-contained through direct testing across tasks and models, with no derivation chain that collapses into prior assumptions or renamings of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies primarily on empirical testing of an iterative process rather than introducing new mathematical axioms or fitted parameters; the core premise is that guideline refinement through moderation improves alignment.

axioms (1)

domain assumption LLMs struggle with specialized annotation conventions but can be aligned via guideline integration and iterative moderation.
This assumption underpins the three hypotheses tested in the evaluation.

pith-pipeline@v0.9.0 · 5654 in / 971 out tokens · 32027 ms · 2026-05-21T05:06:42.806593+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

H1: The incorporation of annotation guidelines into the prompt will improve the LLM-based annotations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 2 internal anchors

[1]

International Conference on Applications of Natural Language to Information Systems , pages=

Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study , author=. International Conference on Applications of Natural Language to Information Systems , pages=. 2025 , organization=

work page 2025
[2]

Journal of Biomedical Informatics , volume =

Islamaj Do. Journal of Biomedical Informatics , volume =. 2014 , doi =

work page 2014
[3]

Nucleic acids research , volume=

TeamTat: a collaborative text annotation tool , author=. Nucleic acids research , volume=. 2020 , publisher=

work page 2020
[4]

and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J

Li, Jiao and Sun, Yueping and Johnson, Robin J. and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J. and Wiegers, Thomas C. and Lu, Zhiyong , journal =. 2016 , doi =

work page 2016
[5]

Briefings in Bioinformatics , volume=

BioRED: a rich biomedical relation extraction dataset , author=. Briefings in Bioinformatics , volume=. 2022 , publisher=

work page 2022
[6]

Proceedings of the 2009 conference on empirical methods in natural language processing , pages=

Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk , author=. Proceedings of the 2009 conference on empirical methods in natural language processing , pages=

work page 2009
[7]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Is GPT-3 a good data annotator? , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[8]

arXiv preprint arXiv:2108.13487 , year=

Want to reduce labeling cost? GPT-3 can help , author=. arXiv preprint arXiv:2108.13487 , year=

work page arXiv
[9]

Proceedings of the National Academy of Sciences of the United States of America , volume =

ChatGPT outperforms crowd workers for text-annotation tasks , author =. Proceedings of the National Academy of Sciences of the United States of America , volume =. 2023 , doi =

work page 2023
[10]

arXiv preprint arXiv:2304.10145 , year=

Can chatgpt reproduce human-generated labels? a study of social computing tasks , author=. arXiv preprint arXiv:2304.10145 , year=

work page arXiv
[11]

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Fonseca, Marcio and Cohen, Shay. Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.478

work page doi:10.18653/v1/2024.findings-acl.478 2024
[12]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

GuideNER: Annotation Guidelines Are Better than Examples for In-Context Named Entity Recognition , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[13]

arXiv preprint arXiv:2502.16377 , year=

Instruction-tuning llms for event extraction with annotation guidelines , author=. arXiv preprint arXiv:2502.16377 , year=

work page arXiv
[14]

and Ross, Mike and Minton, Steven N

Bibal, Adrien and Gerlek, Nathaniel and Muric, Goran and Boschee, Elizabeth and Fincke, Steven C. and Ross, Mike and Minton, Steven N. Automating Annotation Guideline Improvements using LLM s: A Case Study. Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

work page 2025
[15]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , year =. Judging. 2306.05685 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2310.17631 , year=

Judgelm: Fine-tuned large language models are scalable judges , author=. arXiv preprint arXiv:2310.17631 , year=

work page arXiv
[17]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-eval: NLG evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Bioinformatics , volume=

Open Agile text mining for bioinformatics: the PubAnnotation ecosystem , author=. Bioinformatics , volume=. 2019 , publisher=

work page 2019
[19]

Bioinformatics , volume=

DNorm: disease name normalization with pairwise learning to rank , author=. Bioinformatics , volume=. 2013 , publisher=

work page 2013
[20]

large language models: Evaluating annotation scheme and guidelines development for clinical narratives , author=

Human experts vs. large language models: Evaluating annotation scheme and guidelines development for clinical narratives , author=. Proceedings of the Text2Story , pages=

work page
[21]

Named entity recognition with small strongly labeled and large weakly labeled data , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page

[1] [1]

International Conference on Applications of Natural Language to Information Systems , pages=

Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study , author=. International Conference on Applications of Natural Language to Information Systems , pages=. 2025 , organization=

work page 2025

[2] [2]

Journal of Biomedical Informatics , volume =

Islamaj Do. Journal of Biomedical Informatics , volume =. 2014 , doi =

work page 2014

[3] [3]

Nucleic acids research , volume=

TeamTat: a collaborative text annotation tool , author=. Nucleic acids research , volume=. 2020 , publisher=

work page 2020

[4] [4]

and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J

Li, Jiao and Sun, Yueping and Johnson, Robin J. and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J. and Wiegers, Thomas C. and Lu, Zhiyong , journal =. 2016 , doi =

work page 2016

[5] [5]

Briefings in Bioinformatics , volume=

BioRED: a rich biomedical relation extraction dataset , author=. Briefings in Bioinformatics , volume=. 2022 , publisher=

work page 2022

[6] [6]

Proceedings of the 2009 conference on empirical methods in natural language processing , pages=

Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk , author=. Proceedings of the 2009 conference on empirical methods in natural language processing , pages=

work page 2009

[7] [7]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Is GPT-3 a good data annotator? , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[8] [8]

arXiv preprint arXiv:2108.13487 , year=

Want to reduce labeling cost? GPT-3 can help , author=. arXiv preprint arXiv:2108.13487 , year=

work page arXiv

[9] [9]

Proceedings of the National Academy of Sciences of the United States of America , volume =

ChatGPT outperforms crowd workers for text-annotation tasks , author =. Proceedings of the National Academy of Sciences of the United States of America , volume =. 2023 , doi =

work page 2023

[10] [10]

arXiv preprint arXiv:2304.10145 , year=

Can chatgpt reproduce human-generated labels? a study of social computing tasks , author=. arXiv preprint arXiv:2304.10145 , year=

work page arXiv

[11] [11]

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Fonseca, Marcio and Cohen, Shay. Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.478

work page doi:10.18653/v1/2024.findings-acl.478 2024

[12] [12]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

GuideNER: Annotation Guidelines Are Better than Examples for In-Context Named Entity Recognition , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[13] [13]

arXiv preprint arXiv:2502.16377 , year=

Instruction-tuning llms for event extraction with annotation guidelines , author=. arXiv preprint arXiv:2502.16377 , year=

work page arXiv

[14] [14]

and Ross, Mike and Minton, Steven N

Bibal, Adrien and Gerlek, Nathaniel and Muric, Goran and Boschee, Elizabeth and Fincke, Steven C. and Ross, Mike and Minton, Steven N. Automating Annotation Guideline Improvements using LLM s: A Case Study. Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

work page 2025

[15] [15]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , year =. Judging. 2306.05685 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2310.17631 , year=

Judgelm: Fine-tuned large language models are scalable judges , author=. arXiv preprint arXiv:2310.17631 , year=

work page arXiv

[17] [17]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-eval: NLG evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Bioinformatics , volume=

Open Agile text mining for bioinformatics: the PubAnnotation ecosystem , author=. Bioinformatics , volume=. 2019 , publisher=

work page 2019

[19] [19]

Bioinformatics , volume=

DNorm: disease name normalization with pairwise learning to rank , author=. Bioinformatics , volume=. 2013 , publisher=

work page 2013

[20] [20]

large language models: Evaluating annotation scheme and guidelines development for clinical narratives , author=

Human experts vs. large language models: Evaluating annotation scheme and guidelines development for clinical narratives , author=. Proceedings of the Text2Story , pages=

work page

[21] [21]

Named entity recognition with small strongly labeled and large weakly labeled data , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page