CheckMIABench converts LLMs with intermediate checkpoints into clean MIA testbeds by using pre- and post-checkpoint training data from the same distribution and evaluates published attacks on Pythia and OLMo models while releasing an open-source library.
hub Mixed citations
Proceedings of the 2018
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.
A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.
EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.
ENMP prunes negative LoRA modules via evolutionary search to boost merging performance to new state-of-the-art levels across language and vision tasks.
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
Task prompt vectors, formed by subtracting random initialization from tuned soft prompts, support low-resource initialization and arithmetic combination across tasks on 12 NLU datasets while remaining independent of initialization seed on two model architectures.
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Pruning attention layers in five LLMs across eight datasets maintains accuracy but degrades faithfulness and calibration.
A post-hoc detection framework exploits generation-induced patterns in autoregressive image outputs to enable provenance tracing across multiple IAR models without altering the generation process.
SCOPE selects readable hidden layers, constructs conformal gates with IND calibration, and uses supermartingale e-processes to certify persistent service-boundary evidence, improving rejection over final-layer detectors across multiple LLMs and boundary conditions.
LLMs given only research questions from 1000 arXiv CS papers recommend a narrower set of methods than the original papers, with effective model-entity diversity dropping from 1232 to 59-96 and stronger agreement among LLMs than with papers.
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
Empirical benchmarks show distribution similarity between adaptation and pretraining data increases practical privacy leakage in DP-adapted LLMs at fixed theoretical guarantees, with LoRA providing strongest protection for OOD cases.
Explicitly disentangling semantic and positional streams in a Transformer encoder reveals that absolute positional representations collapse to a 2D document-structure manifold, attention heads specialize by role, and the approach improves linguistic probing performance on 49 of 65 phenomena.
SynLearner lets LLMs improve synthetic data generation on later tasks in a stream by learning reusable patterns and balancing quality with diversity from feedback on earlier tasks.
GiLT augments Transformers with semantic dependency graphs by modulating attention to improve syntactic generalization while keeping perplexity competitive and enabling better finetuning on downstream tasks.
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
MC² corrects low-budget Monte Carlo solutions for elliptic PDEs with a single-pass neural network to match the accuracy of 1000× more Monte Carlo samples while outperforming classical and learned baselines.
Extreme Weather Bench supplies standardized case studies, observational data, impact metrics, and code to evaluate weather models on high-impact hazards.
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.
citing papers explorer
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
-
MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving
MC² corrects low-budget Monte Carlo solutions for elliptic PDEs with a single-pass neural network to match the accuracy of 1000× more Monte Carlo samples while outperforming classical and learned baselines.
-
Extreme Weather Bench: A framework and benchmark for evaluation of high-impact weather
Extreme Weather Bench supplies standardized case studies, observational data, impact metrics, and code to evaluate weather models on high-impact hazards.
-
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.