ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
hub
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
38 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.
HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.
EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.
SuperMemory-VQA provides 4,853 human-verified QA pairs from 52.9 hours of egocentric AI glasses recordings to benchmark AI systems on realistic long-horizon memory tasks including an unanswerable option.
Formalizes continual model routing (CMR), releases CMRBench with over 2000 models, and presents CARvE which outperforms retrieval, fine-tuning and adapter-merging baselines on model/family/domain accuracy.
TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.
Builds a 2M-contribution graph from 230k papers with 12.5M prerequisite links and reports 0.48 MAP on temporal backtesting for predicting enabling technologies.
AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
SRA reframes CTKD by aligning attention-weighted span centers of mass in a multi-particle system model with geometric regularization and span logit distillation, claiming consistent outperformance over baselines.
Reinforcement learning with a multi-part reward teaches LLMs to output independent, meaning-preserving sentence edits that raise argument appropriateness close to full rewriting.
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Context-aware distillation with BNF+API+vocabulary scales PolkitBench to 10,073 pairs at 99.7% runtime pass rate; ablation on GigaChat-10B shows vocabulary adds +0.198 combined score while API/BNF add 22-25pp structural validity.
K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.
Multi-response training retains multiple responses per prompt to reduce uncertainty about the conditional output distribution, yielding improved distributional generalization especially in high response-diversity and low prompt-redundancy regimes.
SynLearner lets LLMs improve synthetic data generation on later tasks in a stream by learning reusable patterns and balancing quality with diversity from feedback on earlier tasks.
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
citing papers explorer
-
A Survey on LLM-as-a-Judge
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.