DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
hub
Rationale- augmented ensembles in language models
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
A multi-agent LLM system discovers criteria such as Encouraging, Urgent, and Clear for surgical feedback and uses them to score 4.2k instances, outperforming prior content-based approaches in predicting trainee behavior changes and trainer approval.
A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.
A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.
A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.
The paper identifies inadequately addressed challenges in optimizing task allocation, fostering robust reasoning through debates, managing layered context, enhancing memory, and applying multi-agent systems to blockchain.
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
citing papers explorer
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.