super hub Canonical reference

A Survey of Large Language Models

Junyi Li, Kun Zhou, Tianyi Tang, Wayne Xin Zhao, Xiaolei Wang, Yupeng Hou · 2023 · cs.CL · arXiv 2303.18223

Canonical reference. 85% of citing Pith papers cite this work as background.

309 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 309 citing papers more from Junyi Li arXiv PDF

abstract

Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since researchers have found that model scaling can lead to performance improvement, they further study the scaling effect by increasing the model size to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement but also show some special abilities that are not present in small-scale language models. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. In this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 50 method 4 dataset 1

citation-polarity summary

background 47 use method 4 support 2 unclear 1 use dataset 1

claims ledger

abstract Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since

authors

Junyi Li Kun Zhou Tianyi Tang Wayne Xin Zhao Xiaolei Wang Yupeng Hou

co-cited works

representative citing papers

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 8.0

A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

cs.CV · 2026-05-11 · unverdicted · novelty 8.0 · 2 refs

Hilbert-Geo creates the first unified formal language for solid geometry and a two-step parsing-then-reasoning method that reaches SOTA accuracy on solid geometry benchmarks.

Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

cs.LG · 2026-05-03 · unverdicted · novelty 8.0 · 2 refs

Momentum-based async SGD achieves optimal convergence rates for data-dependent delays without biasing updates toward simpler samples.

Diffusion-CAM: Faithful Visual Explanations for dMLLMs

cs.AI · 2026-04-13 · unverdicted · novelty 8.0

Diffusion-CAM is the first method for visual explanations in dMLLMs, using differentiable probing of intermediates plus four refinement modules to produce activation maps that outperform prior CAM approaches in localization and fidelity.

TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation

cs.CR · 2026-04-08 · unverdicted · novelty 8.0

TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

cs.IR · 2024-03-06 · unverdicted · novelty 8.0

BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

cs.AI · 2026-06-10 · unverdicted · novelty 7.0

SciAgentArena is a new interactive benchmark for AI agents on scientific tasks that finds agents handle clear data-analysis workflows but struggle with novel insights, self-directed exploration, and open-ended questions.

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

cs.LG · 2026-06-08 · conditional · novelty 7.0

A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

cs.LG · 2026-06-06 · unverdicted · novelty 7.0

DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.

A Taxonomy of Runtime Faults in Model Context Protocol Servers

cs.SE · 2026-06-03 · conditional · novelty 7.0

An empirical taxonomy of 11 top-level categories and 27 subcategories of runtime faults in MCP servers, derived via open coding of GitHub threads and validated by a survey of 55 developers.

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

P²-DPO generates on-policy preference pairs targeting focus-and-enhance perception and visual robustness, combined with a calibration loss, to reduce hallucinations in LVLMs more effectively than human-feedback baselines.

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

cs.CR · 2026-05-22 · unverdicted · novelty 7.0

CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus prior coarse-grained methods.

Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

Seizure-Semiology-Suite provides a new clinically annotated video dataset and hierarchical benchmark that exposes weaknesses in current MLLMs for seizure semiology and demonstrates gains from fine-tuning and a neuro-symbolic classifier reaching 0.96 F1.

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

cs.AI · 2026-05-20 · unverdicted · novelty 7.0

A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.

Modality-Decoupled Online Recursive Editing

cs.LG · 2026-05-19 · conditional · novelty 7.0

M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.

Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents

cs.SE · 2026-05-18 · conditional · novelty 7.0

Reversa is a reverse documentation engineering framework that deploys a multi-agent pipeline to extract implicit rules from legacy software and produce traceable specifications with confidence scores and explicit gaps for human review.

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.

Rover: Context-aware Conflict Resolution with LLM

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

Rover uses a new Multi-layer Code Property Graph and clustering to supply LLMs with dependency-aware contexts, outperforming standalone LLMs, MergeGen, and WizardMerge on similarity to ground-truth conflict resolutions.

MLPs are Efficient Distilled Generative Recommenders

cs.IR · 2026-05-12 · unverdicted · novelty 7.0

SID-MLP distills autoregressive generative recommenders into efficient position-specific MLP heads for Semantic ID tasks, achieving 8.74x faster inference with matching accuracy.

Variance-aware Reward Modeling with Anchor Guidance

stat.ML · 2026-05-12 · unverdicted · novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.

citing papers explorer

Showing 4 of 4 citing papers after filters.

NaiAD: Initiate Data-Driven Research for LLM Advertising cs.LG · 2026-05-11 · unverdicted · none · ref 44 · internal anchor
NaiAD is a new dataset and framework for LLM-native advertising that uses decoupled generation and calibrated scoring to identify four semantic strategies for balancing user and commercial utilities.
InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking cs.AI · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
InfoSeeker is a new hierarchical parallel agent framework that delivers 3-5x speedups and benchmark gains on web search tasks by using context isolation and layered aggregation.
Agentic AI for Substance Use Education: Integrating Regulatory and Scientific Knowledge Sources cs.CL · 2026-05-01 · conditional · none · ref 43 · internal anchor
The authors built and expert-evaluated an agentic AI system integrating DEA regulatory data with dynamic scientific literature via RAG to provide accurate, context-sensitive substance use education, with mean Likert ratings of 4.18-4.35 and substantial rater agreement.
From Incomplete Architecture to Quantified Risk: Multimodal LLM-Driven Security Assessment for Cyber-Physical Systems cs.CR · 2026-04-07 · unverdicted · none · ref 58 · internal anchor
ASTRAL applies multimodal LLMs with prompt chaining and few-shot learning to synthesize CPS architectures from disparate sources, enabling adaptive threat identification and quantitative risk estimation, as supported by ablation studies and feedback from 14 cybersecurity practitioners.

A Survey of Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer