Diffusion-CAM is the first method for visual explanations in dMLLMs, using differentiable probing of intermediates plus four refinement modules to produce activation maps that outperform prior CAM approaches in localization and fidelity.
super hub Canonical reference
A Survey of Large Language Models
Canonical reference. 83% of citing Pith papers cite this work as background.
abstract
Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since researchers have found that model scaling can lead to performance improvement, they further study the scaling effect by increasing the model size to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement but also show some special abilities that are not present in small-scale language models. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. In this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since
authors
co-cited works
roles
background 6representative citing papers
TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.
SID-MLP distills autoregressive generative recommenders into efficient position-specific MLP heads for Semantic ID tasks, achieving 8.74x faster inference with matching accuracy.
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
NaiAD is a new dataset and framework for LLM-native advertising that uses decoupled generation and calibrated scoring to identify four semantic strategies for balancing user and commercial utilities.
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.
LLMorphism is a proposed bias where exposure to human-like AI language leads people to view their own thinking as similar to statistical next-token prediction, risking under-attribution of mind to humans.
Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth ordering, and shape accuracy while enabling distillation of semantic knowledge to
LLMs extract explicit constraints effectively but struggle with implicit open-world requirements, structural biases in plans, and ineffective self-correction during travel planning.
Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
ProMax uses dense retrieval and dual distribution reshaping on LLM-derived profiles to guide recommender models toward preferences for unseen items, substantially boosting base model performance on public datasets.
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
ConflictQA benchmark shows LLMs fail to resolve conflicts between text and KG evidence and often default to one source, motivating the XoT explanation-based reasoning method.
A multi-agent framework reconstructs the evolutionary graph of post-training LLM datasets, revealing domain patterns like vertical refinement in math data and systemic issues like redundancy and benchmark contamination, then applies it to create a more diverse lineage-aware dataset.
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
citing papers explorer
-
Diffusion-CAM: Faithful Visual Explanations for dMLLMs
Diffusion-CAM is the first method for visual explanations in dMLLMs, using differentiable probing of intermediates plus four refinement modules to produce activation maps that outperform prior CAM approaches in localization and fidelity.
-
TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation
TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations
IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.
-
MLPs are Efficient Distilled Generative Recommenders
SID-MLP distills autoregressive generative recommenders into efficient position-specific MLP heads for Semantic ID tasks, achieving 8.74x faster inference with matching accuracy.
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
NaiAD: Initiate Data-Driven Research for LLM Advertising
NaiAD is a new dataset and framework for LLM-native advertising that uses decoupled generation and calibrated scoring to identify four semantic strategies for balancing user and commercial utilities.
-
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
-
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.
-
LLMorphism: When humans come to see themselves as language models
LLMorphism is a proposed bias where exposure to human-like AI language leads people to view their own thinking as similar to statistical next-token prediction, risking under-attribution of mind to humans.
-
Anny-Fit: All-Age Human Mesh Recovery
Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth ordering, and shape accuracy while enabling distillation of semantic knowledge to
-
Revisiting the Travel Planning Capabilities of Large Language Models
LLMs extract explicit constraints effectively but struggle with implicit open-world requirements, structural biases in plans, and ineffective self-correction during travel planning.
-
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
-
ProMax: Exploring the Potential of LLM-derived Profiles with Distribution Shaping for Recommender Systems
ProMax uses dense retrieval and dual distribution reshaping on LLM-derived profiles to guide recommender models toward preferences for unseen items, substantially boosting base model performance on public datasets.
-
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
-
Participatory provenance as representational auditing for AI-mediated public consultation
Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
-
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
-
Self-Improving Tabular Language Models via Iterative Group Alignment
TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.
-
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
-
Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method
ConflictQA benchmark shows LLMs fail to resolve conflicts between text and KG evidence and often default to one source, motivating the XoT explanation-based reasoning method.
-
Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs
A multi-agent framework reconstructs the evolutionary graph of post-training LLM datasets, revealing domain patterns like vertical refinement in math data and systemic issues like redundancy and benchmark contamination, then applies it to create a more diverse lineage-aware dataset.
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation
FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.
-
Large Language Models Align with the Human Brain during Creative Thinking
LLMs show scaling and training-dependent alignment with human brain responses in creativity-related networks during divergent thinking tasks, measured via RSA on fMRI data.
-
InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking
InfoSeeker is a new hierarchical parallel agent framework that delivers 3-5x speedups and benchmark gains on web search tasks by using context isolation and layered aggregation.
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
-
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.
-
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.
-
STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering
STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.
-
CHAL: Council of Hierarchical Agentic Language
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
-
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and long-term agent benchmarks.
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
-
Conditional Memory Enhanced Item Representation for Generative Recommendation
ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
-
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.
-
PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives
PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
Evaluating the False Trust engendered by LLM Explanations
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
-
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
Event Fields: Learning Latent Event Structure for Waveform Foundation Models
Event-centric waveform foundation models are learned via self-supervised consistency on latent event structures and interactions, yielding improved performance and label efficiency over sequence-based baselines on physiological tasks.
-
Mechanism Design for Quality-Preserving LLM Advertising
A quality-preserving auction framework for LLM advertising uses RAG-based endogenous reserves and KL-regularized or screened VCG mechanisms to achieve DSIC, IR, higher revenue, and better semantic fidelity than baselines.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
-
The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias
Causal analysis of LLMs finds standard bias metrics overestimate demographic effects due to context toxicity, with Western models showing higher refusal rates for certain groups and Eastern models showing targeted regional sensitivities.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.