pith. machine review for the scientific record. sign in

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.

fields

cs.LG 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

Policy Improvement Reinforcement Learning

cs.LG · 2026-04-01 · unverdicted · novelty 6.0

PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.

citing papers explorer

Showing 1 of 1 citing paper.

  • Policy Improvement Reinforcement Learning cs.LG · 2026-04-01 · unverdicted · none · ref 7 · internal anchor

    PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.