hub Canonical reference

Gpt-4 technical report

OpenAI · 2023

Canonical reference. 71% of citing Pith papers cite this work as background.

51 Pith papers citing it

Background 71% of classified citations

browse 51 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 19 baseline 4 method 1

citation-polarity summary

background 17 baseline 4 support 1 unclear 1 use method 1

representative citing papers

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Mind2Web: Towards a Generalist Agent for the Web

cs.CL · 2023-06-09 · accept · novelty 8.0

Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

cs.CL · 2024-10-14 · conditional · novelty 7.0

DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Detecting Pretraining Data from Large Language Models

cs.CL · 2023-10-25 · conditional · novelty 7.0

Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

cs.CV · 2023-10-17 · accept · novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

Ring Attention with Blockwise Transformers for Near-Infinite Context

cs.CL · 2023-10-03 · unverdicted · novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

cs.LG · 2023-10-03 · conditional · novelty 7.0

Time-LLM reprograms frozen LLMs for time series forecasting via text prototypes and Prompt-as-Prefix, outperforming specialized models in standard, few-shot, and zero-shot settings.

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

cs.CL · 2023-07-30 · unverdicted · novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

cs.CL · 2023-06-05 · unverdicted · novelty 7.0

RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.

LIMA: Less Is More for Alignment

cs.CL · 2023-05-18 · conditional · novelty 7.0

Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.

WizardLM: Empowering large pre-trained language models to follow complex instructions

cs.CL · 2023-04-24 · conditional · novelty 7.0

WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

cs.LG · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.

EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and outperforms static baselines on GAIA, HLE, and DeepResearcher.

DeepSeek-OCR: Contexts Optical Compression

cs.CV · 2025-10-21 · unverdicted · novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

cs.CL · 2025-09-17 · unverdicted · novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

cs.RO · 2024-12-09 · unverdicted · novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

cs.CR · 2024-03-28 · accept · novelty 6.0

JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.

citing papers explorer

Showing 25 of 25 citing papers after filters.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 37
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Mind2Web: Towards a Generalist Agent for the Web cs.CL · 2023-06-09 · accept · none · ref 25
Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
TinyStories: How Small Can Language Models Be and Still Speak Coherent English? cs.CL · 2023-05-12 · conditional · none · ref 21
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cs.CL · 2024-10-14 · conditional · none · ref 36
DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 184
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Detecting Pretraining Data from Large Language Models cs.CL · 2023-10-25 · conditional · none · ref 48
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
Ring Attention with Blockwise Transformers for Near-Infinite Context cs.CL · 2023-10-03 · unverdicted · none · ref 31
Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension cs.CL · 2023-07-30 · unverdicted · none · ref 2
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems cs.CL · 2023-06-05 · unverdicted · none · ref 30
RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.
LIMA: Less Is More for Alignment cs.CL · 2023-05-18 · conditional · none · ref 45
Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
WizardLM: Empowering large pre-trained language models to follow complex instructions cs.CL · 2023-04-24 · conditional · none · ref 32
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 244
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Llemma: An Open Language Model For Mathematics cs.CL · 2023-10-16 · unverdicted · none · ref 164
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs cs.CL · 2023-10-03 · conditional · none · ref 92
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
Efficient Streaming Language Models with Attention Sinks cs.CL · 2023-09-29 · accept · none · ref 36
StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 29
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models cs.CL · 2023-08-03 · unverdicted · none · ref 86
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 cs.CL · 2023-06-05 · conditional · none · ref 32
A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models cs.CL · 2023-04-13 · accept · none · ref 67
AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face cs.CL · 2023-03-30 · unverdicted · none · ref 20
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
Language Models can Solve Computer Tasks cs.CL · 2023-03-30 · accept · none · ref 46
Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong cs.CL · 2025-01-16 · unverdicted · none · ref 19
Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.
Detecting Language Model Attacks with Perplexity cs.CL · 2023-08-27 · unverdicted · none · ref 22
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.
LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning cs.CL · 2023-08-07 · unverdicted · none · ref 43
LoRA-FA freezes LoRA's A matrix and trains only B with gradient corrections to approximate full fine-tuning gradients more closely.
Secrets of RLHF in Large Language Models Part I: PPO cs.CL · 2023-07-11 · unverdicted · none · ref 3
Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.

Gpt-4 technical report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer