{"total":18,"items":[{"citing_arxiv_id":"2606.01451","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Before and After Temperature: A Distributional View of Creative LLM Generation","primary_cat":"cs.CL","submitted_at":"2026-05-31T21:13:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A per-token feature from temperature-induced changes in LLM token distributions predicts within-prompt creativity rank at Spearman rho 0.918 vs LLM judges and 0.870 vs humans, outperforming perplexity, entropy, top-1 margin, and compression baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28571","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making","primary_cat":"cs.HC","submitted_at":"2026-05-27T14:56:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A between-subjects experiment (N=192) finds that token-level uncertainty increases agreement with LLM answers while relation-level uncertainty reduces external verification in medical decision tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28264","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Entropy Distribution as a Fingerprint for Hallucinations in Generative Models","primary_cat":"cs.AI","submitted_at":"2026-05-27T10:12:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Token entropy distributions fingerprint hallucinations in generative models, enabling the Calibrated Entropy Score (CES) for single-pass black-box detection with calibration guarantees via a novel DKW inequality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20084","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-05-19T16:38:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BalanceRAG uses sequential graphical testing on a 2D lattice of threshold pairs to certify safe operating points that meet target risk levels in cascaded RAG while increasing coverage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19220","ref_index":140,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering","primary_cat":"cs.CL","submitted_at":"2026-05-19T00:47:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10893","ref_index":50,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking","primary_cat":"cs.CL","submitted_at":"2026-05-11T17:35:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Comparing Full to −Lrank isolates the contribution of the ranking loss in each bin (with Lbrier held constant); the Lbce-only column shows the joint effect of removing both auxiliary losses. \"Gap\" =|Pred.−Act.| ; lower is better. All values are percentages, with the lowest gap per row bolded. Full (BICR)−L rank Lbce only BinPred. Act. Gap Pred. Act. Gap Pred. Act. Gap [0.0,0.1)6.3 23.3 16.9 6.6 20.614.06.7 21.8 15.0 [0.1,0.2)15.2 31.516.315.4 32.5 17.1 15.4 33.9 18.5 [0.2,0.3)25.1 38.413.325.2 40.8 15.6 25.3 42.8 17.5 [0.3,0.4)35.1 45.210.235.2 46.3 11.2 35.1 47.7 12.6 [0.4,0.5)45.0 50.4 5.4 45.1 49.04.045.0 50.1 5.1 [0.5,0.6)55.0 56.01.155.0 52.2 2.7 55.0 52.9 2.1 [0.6,0.7)65.0 63.41.665.0 56.7 8.3 65.0 57.8 7.1 [0.7,0.8)75.0 72.03.175.0 64.3 10.7 75.0 65.2 9.8 [0.8,0.9)85.3 83.61.885.3 77.4 7.9 85.3 77.9 7.4 [0.9,1.0]95.9 94.91.095.9 93.3 2.7 96.0 93.0 3.0 0.0 0.5 1.0 0.0 0.5 1.0Empirical Accuracy ECE : 0.056 Full 0.0 0.5 1.0 ECE : 0.070 no_brier 0.0 0.5 1.0 ECE : 0.075 no_rank 0.0 0.5 1.0 ECE : 0.080 bce_only Predicted Confidence Figure 4: Per-seed reliability diagrams for each ablation variant. Each panel shows one reliability curve per seed (5 translucent red curves) and a grey histogram in the lower portion of the panel"},{"citing_arxiv_id":"2605.06201","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric","primary_cat":"cs.AI","submitted_at":"2026-05-07T13:09:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05777","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-07T07:09:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08149","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-03T18:43:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Feature rivalry in SAE representations strengthens with model uncertainty on high-entropy questions, enables output steering, and predicts answer correctness with AUROC 0.689 in Gemma-2-2B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19444","ref_index":129,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation","primary_cat":"cs.LG","submitted_at":"2026-04-21T13:25:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08977","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Testing the Assumptions of Active Learning for Translation Tasks with Few Samples","primary_cat":"cs.CL","submitted_at":"2026-04-10T05:30:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Informativeness and diversity of samples selected by active learning show no correlation with test performance on translation tasks using few samples; ordering and pre-training effects dominate instead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08974","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning","primary_cat":"cs.CL","submitted_at":"2026-04-10T05:27:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"ity, computed using inputxand model outputˆy. Definition 3.2.We evaluate a confidence score by itstask-level correlation, which measures how well the confidence score positively correlates with the outputs' quality2 3 . Correlation=ρ(Confidence(X, ˆY),Quality( ˆY)) We measure task-level correlation using the Spearman correlation ρ following Zablotskaia et al. (2023); Malinin and Gales (2021), which captures the notion that higher confidence should be associ- ated with higher quality. Definition 3.3.We say a confidence score ismis- correlatedif task-level correlation is low. We test probability and consistency based UQ metrics; we focus on these two groups of metrics as they can be used with any white-box model. Probability-Based(1) average token log proba-"},{"citing_arxiv_id":"2603.27098","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ensemble-Based Uncertainty Estimation for Code Correctness Estimation","primary_cat":"cs.SE","submitted_at":"2026-03-28T02:37:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.11689","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks","primary_cat":"cs.AI","submitted_at":"2026-03-12T08:56:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces Explicit Logic Channel (ELC) with LLM, VFM and probabilistic inference for validating, selecting and enhancing MLLMs on zero-shot tasks using Consistency Rate and cross-channel integration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.05110","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts","primary_cat":"cs.AI","submitted_at":"2026-01-08T16:58:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GlimpRouter uses the entropy of the first token in each reasoning step to decide whether to invoke a large model, yielding 10.7% higher accuracy and 25.9% lower latency than a standalone large model on AIME25.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.26522","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Entropy After </Think> for reasoning model early exiting","primary_cat":"cs.LG","submitted_at":"2025-09-30T16:59:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23108","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Artificial Phantasia: Emergent Mental Imagery in Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-09-27T04:36:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2302.09664","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation","primary_cat":"cs.CL","submitted_at":"2023-02-19T20:10:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Semantic entropy improves uncertainty estimation in natural language generation by incorporating semantic equivalences, outperforming standard entropy baselines on predicting model accuracy for question answering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}