LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
Large language models can self-improve
7 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 7representative citing papers
AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
citing papers explorer
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions
AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
-
Latent Preference Modeling for Cross-Session Personalized Tool Calling
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
-
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
-
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
-
Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation
LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.