LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
hub
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , year =
21 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
An empirical study distills a taxonomy of human factual errors from newspaper corrections and shows LLMs achieve only 52% F1 on detection.
TrustMargin arbitrates between direct and RAG answers from a frozen LLM by combining a parametric-prior margin and an evidence-binding margin computed from model likelihoods, improving results on 2WikiMQA and CWQA.
Indirect elicitation via triplet comparisons recovers meaningful association structures from LLMs and supports conservative causal candidate links across prompted subpopulations.
Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.
Cross-lingual prompt exploration improves factual recall and consistency in LLMs across 17 languages more efficiently than native-language scaling.
Empirical study of LLM brand recommendations across industries finds moderate concentration (mean Gini 0.28) and low cross-model agreement (41.6%) on top brands.
Controlled experiments show implicit multi-hop reasoning in LLMs requires prior exposure to compositional contexts during pretraining and does not transfer to unexposed individuals.
Expert-aware causal tracing localizes factual recall to specific experts in some MoE models but requires coalitions in others, using CounterFact interventions on subject embeddings.
A unified multi-component causal tracing method that uses soft interventions and a metric transformation to efficiently select critical LLM components for a target performance metric.
LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
LLM-Metrics probes memory in 17 LLMs across 549 2023-2024 CS papers and finds a modest Spearman correlation (rho=0.1495) with citation counts, stronger for 2024 papers.
QREAM rewrites documents to question-focused style using iterative ICL and distilled FT models, boosting RAG performance by up to 8% relative improvement.
KLCF formalizes long-form factuality as bidirectional distribution matching between expressed and parametric knowledge, using a sampled factual checklist for recall and a truthfulness reward for precision.
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
LRP-based attention head selection and distributed application improve the efficiency and accuracy of function vectors for steering LLMs compared to prior choices.
The thesis identifies theoretical, empirical, and conceptual flaws in offline fairness measures for recommender systems and contributes new evaluation methods and practical guidelines.
citing papers explorer
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
Locating and Editing Factual Associations in GPT
Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
-
An Empirical Analysis of Factual Errors in Human-Written Text and its Application
An empirical study distills a taxonomy of human factual errors from newspaper corrections and shows LLMs achieve only 52% F1 on detection.
-
TrustMargin: Training-Free Arbitration between Parametric Memory and Retrieved Evidence in Large Language Models
TrustMargin arbitrates between direct and RAG answers from a frozen LLM by combining a parametric-prior margin and an evidence-binding margin computed from model likelihoods, improving results on 2WikiMQA and CWQA.
-
Eliciting associations between clinical variables from LLMs via comparison questions across populations
Indirect elicitation via triplet comparisons recovers meaningful association structures from LLMs and supports conservative causal candidate links across prompted subpopulations.
-
Norm Anchors Make Model Edits Last
Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA
Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.
-
Cross-Lingual Exploration for Parametric Knowledge
Cross-lingual prompt exploration improves factual recall and consistency in LLMs across 17 languages more efficiently than native-language scaling.
-
Who Owns the AI Recommendation? A Multi-Industry Empirical Map of Brand Category Ownership Across Large Language Models
Empirical study of LLM brand recommendations across industries finds moderate concentration (mean Gini 0.28) and low cross-model agreement (41.6%) on top brands.
-
Multi-Hop Knowledge Composition is Bound by Pretraining Exposure
Controlled experiments show implicit multi-hop reasoning in LLMs requires prior exposure to compositional contexts during pretraining and does not transfer to unexposed individuals.
-
Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models
Expert-aware causal tracing localizes factual recall to specific experts in some MoE models but requires coalitions in others, using CounterFact interventions on subject embeddings.
-
Multi-component Causal Tracing in Large Language Models
A unified multi-component causal tracing method that uses soft interventions and a metric transformation to efficiently select critical LLM components for a target performance metric.
-
Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.
-
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
-
LLM-Metrics: Measuring Research Impact Through Large Language Model Memory
LLM-Metrics probes memory in 17 LLMs across 549 2023-2024 CS papers and finds a modest Spearman correlation (rho=0.1495) with citation counts, stronger for 2024 papers.
-
Align Documents to Questions: Question-Oriented Document Rewriting for Retrieval-Augmented Generation
QREAM rewrites documents to question-focused style using iterative ICL and distilled FT models, boosting RAG performance by up to 8% relative improvement.
-
Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality
KLCF formalizes long-form factuality as bidirectional distribution matching between expressed and parametric knowledge, using a sampled factual checklist for recall and a truthfulness reward for precision.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Fast & Faithful Function Vectors
LRP-based attention head selection and distributed application improve the efficiency and accuracy of function vectors for steering LLMs compared to prior choices.
-
Offline Evaluation Measures of Fairness in Recommender Systems
The thesis identifies theoretical, empirical, and conceptual flaws in offline fairness measures for recommender systems and contributes new evaluation methods and practical guidelines.