hub

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A · 2023 · DOI 10.18653/v1/2023.acl-long.754

33 Pith papers cite this work. Polarity classification is still indexing.

33 Pith papers citing it

open at publisher browse 33 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

HARP: Efficient Data Selection for Finetuning Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

SuperMemory-VQA provides 4,853 human-verified QA pairs from 52.9 hours of egocentric AI glasses recordings to benchmark AI systems on realistic long-horizon memory tasks including an unanswerable option.

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.

Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

Reinforcement learning with a multi-part reward teaches LLMs to output independent, meaning-preserving sentence edits that raise argument appropriateness close to full rewriting.

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

cs.CL · 2025-02-28 · unverdicted · novelty 7.0

CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

cs.SE · 2025-02-25 · unverdicted · novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

Context-Aware Distillation and Ablation for Text2DSL

cs.CL · 2026-06-21 · unverdicted · novelty 6.0

Context-aware distillation with BNF+API+vocabulary scales PolkitBench to 10,073 pairs at 99.7% runtime pass rate; ablation on GigaChat-10B shows vocabulary adds +0.198 combined score while API/BNF add 22-25pp structural validity.

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.

Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

Multi-response training retains multiple responses per prompt to reduce uncertainty about the conditional output distribution, yielding improved distributional generalization especially in high response-diversity and low prompt-redundancy regimes.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.

Distribution Corrected Offline Data Distillation for Large Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

cs.CL · 2025-10-04 · unverdicted · novelty 6.0

Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

cs.LG · 2025-09-17 · unverdicted · novelty 6.0

Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

Robots Need More than VLA and World Models

cs.RO · 2026-06-04 · unverdicted · novelty 5.0

The paper identifies four missing interfaces (data autolabelling, embodiment retargeting, physics-grounded world models, and video-based reward inference) as the central bottleneck beyond VLA scaling for robot intelligence.

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.

torchtune: PyTorch native post-training library

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.

High-quality generation of dynamic game content via small language models: A proof of concept

cs.AI · 2026-01-30 · conditional · novelty 5.0

Proof-of-concept shows fine-tuned small language models achieve adequate quality for real-time game content generation in a scoped RPG loop via retry-until-success and LLM-as-judge evaluation.

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

cs.CL · 2024-08-28 · unverdicted · novelty 5.0

WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.

citing papers explorer

Showing 33 of 33 citing papers.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL · 2026-04-19 · unverdicted · none · ref 74
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning cs.SE · 2026-06-18 · unverdicted · none · ref 67
Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.
HARP: Efficient Data Selection for Finetuning Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 9
HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.
EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision cs.CL · 2026-06-01 · unverdicted · none · ref 20
EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory cs.CV · 2026-05-30 · unverdicted · none · ref 59
SuperMemory-VQA provides 4,853 human-verified QA pairs from 52.9 hours of egocentric AI glasses recordings to benchmark AI systems on realistic long-horizon memory tasks including an unanswerable option.
AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions cs.CL · 2026-05-13 · unverdicted · none · ref 55
AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning cs.CL · 2026-04-14 · unverdicted · none · ref 46
Reinforcement learning with a multi-part reward teaches LLMs to output independent, meaning-preserving sentence edits that raise argument appropriateness close to full rewriting.
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation cs.CL · 2025-02-28 · unverdicted · none · ref 128
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution cs.SE · 2025-02-25 · unverdicted · none · ref 103
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 119
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Context-Aware Distillation and Ablation for Text2DSL cs.CL · 2026-06-21 · unverdicted · none · ref 25
Context-aware distillation with BNF+API+vocabulary scales PolkitBench to 10,073 pairs at 99.7% runtime pass rate; ablation on GigaChat-10B shows vocabulary adds +0.198 combined score while API/BNF add 22-25pp structural validity.
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts cs.CL · 2026-06-01 · unverdicted · none · ref 32
K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.
Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization cs.LG · 2026-05-30 · unverdicted · none · ref 41
Multi-response training retains multiple responses per prompt to reduce uncertainty about the conditional output distribution, yielding improved distributional generalization especially in high response-diversity and low prompt-redundancy regimes.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 11
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 178
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Distribution Corrected Offline Data Distillation for Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 42
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models cs.CL · 2025-10-04 · unverdicted · none · ref 2
Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision cs.LG · 2025-09-17 · unverdicted · none · ref 25
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 106
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Robots Need More than VLA and World Models cs.RO · 2026-06-04 · unverdicted · none · ref 133
The paper identifies four missing interfaces (data autolabelling, embodiment retargeting, physics-grounded world models, and video-based reward inference) as the central bottleneck beyond VLA scaling for robot intelligence.
Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing cs.LG · 2026-05-30 · unverdicted · none · ref 70
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.
torchtune: PyTorch native post-training library cs.LG · 2026-05-20 · unverdicted · none · ref 38
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
High-quality generation of dynamic game content via small language models: A proof of concept cs.AI · 2026-01-30 · conditional · none · ref 31
Proof-of-concept shows fine-tuned small language models achieve adequate quality for real-time game content generation in a scoped RPG loop via retry-until-success and LLM-as-judge evaluation.
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback cs.CL · 2024-08-28 · unverdicted · none · ref 38
WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.
TALAS: Teacher-Anchored Layer Alignment with Adaptive Sharpness-Aware Minimization for Embedding Distillation cs.CL · 2026-06-20 · unverdicted · none · ref 68
TALAS is a knowledge distillation method that selectively aligns upper student layers to teacher sentence embeddings, propagates knowledge top-down via relational constraints in lower layers, and uses ASAM to seek flatter minima.
SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants cs.CL · 2026-06-11 · unverdicted · none · ref 51
SkillChain automates skill lifecycle for e-commerce image AI assistants via creator, optimizer, and refiner stages, leading to improved response quality and user engagement in production A/B tests.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 83 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
Less LLM, More Documents: Searching for Improved RAG cs.IR · 2025-10-03 · unverdicted · none · ref 33
Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding cs.CL · 2025-04-02 · unverdicted · none · ref 21
A new open SFT dataset for reasoning distillation lets coding models hit state-of-the-art scores on LiveCodeBench and CodeContests with supervised fine-tuning alone, outperforming RL-trained baselines.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 165
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction cs.CL · 2026-05-05 · unreviewed · ref 40
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation cs.CL · 2026-05-02 · unreviewed · ref 67
SRA: Span Representation Alignment for Large Language Model Distillation cs.CL · 2026-05-02 · unreviewed · ref 67

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer