hub

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Wang, Y · 2023 · DOI 10.18653/v1/2023.acl-long.754

38 Pith papers cite this work. Polarity classification is still indexing.

38 Pith papers citing it

open at publisher browse 38 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

HARP: Efficient Data Selection for Finetuning Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

SuperMemory-VQA provides 4,853 human-verified QA pairs from 52.9 hours of egocentric AI glasses recordings to benchmark AI systems on realistic long-horizon memory tasks including an unanswerable option.

Continual Model Routing in Evolving Model Hubs

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

Formalizes continual model routing (CMR), releases CMRBench with over 2000 models, and presents CARvE which outperforms retrieval, fine-tuning and adapter-merging baselines on model/family/domain accuracy.

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.

The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

Builds a 2M-contribution graph from 230k papers with 12.5M prerequisite links and reports 0.48 MAP on temporal backtesting for predicting enabling technologies.

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.

SRA: Span Representation Alignment for Large Language Model Distillation

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

SRA reframes CTKD by aligning attention-weighted span centers of mass in a multi-particle system model with geometric regularization and span logit distillation, claiming consistent outperformance over baselines.

Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

Reinforcement learning with a multi-part reward teaches LLMs to output independent, meaning-preserving sentence edits that raise argument appropriateness close to full rewriting.

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

cs.CL · 2025-02-28 · unverdicted · novelty 7.0

CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

cs.SE · 2025-02-25 · unverdicted · novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

Context-Aware Distillation and Ablation for Text2DSL

cs.CL · 2026-06-21 · unverdicted · novelty 6.0

Context-aware distillation with BNF+API+vocabulary scales PolkitBench to 10,073 pairs at 99.7% runtime pass rate; ablation on GigaChat-10B shows vocabulary adds +0.198 combined score while API/BNF add 22-25pp structural validity.

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.

Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

Multi-response training retains multiple responses per prompt to reduce uncertainty about the conditional output distribution, yielding improved distributional generalization especially in high response-diversity and low prompt-redundancy regimes.

Make LLM Learn to Synthesize from Streaming Experiences through Feedback

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

SynLearner lets LLMs improve synthetic data generation on later tasks in a stream by learning reusable patterns and balancing quality with diversity from feedback on earlier tasks.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.

Distribution Corrected Offline Data Distillation for Large Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

cs.CL · 2025-10-04 · unverdicted · novelty 6.0

Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

cs.LG · 2025-09-17 · unverdicted · novelty 6.0

Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.

citing papers explorer

Showing 1 of 1 citing paper after filters.

A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 165
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer