{"total":16,"items":[{"citing_arxiv_id":"2606.20676","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity","primary_cat":"cs.CV","submitted_at":"2026-06-12T15:53:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10307","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate","primary_cat":"cs.CL","submitted_at":"2026-06-09T01:52:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Early-token log-probabilities from LLM decoding are stronger predictors of reasoning quality than full-sequence statistics in multi-agent debate on essay scoring tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10296","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge","primary_cat":"cs.CL","submitted_at":"2026-06-09T01:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"In two-agent debate, log-probability confidence aligns with LLM-judged reasoning quality roughly twice as strongly for the Constructor (AUROC 0.804 for critical failure detection) as for the Auditor (0.634).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05384","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges","primary_cat":"cs.AI","submitted_at":"2026-06-03T19:37:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02289","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations","primary_cat":"cs.CL","submitted_at":"2026-06-01T14:11:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The DECK taxonomy partitions LLM hallucinations into four detectability regimes using consistency and confidence axes, mapping each to scorer families and identifying a universal blind spot for output-level uncertainty quantification on knowledge-gap inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20364","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation","primary_cat":"cs.CL","submitted_at":"2026-05-19T18:16:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10862","ref_index":31,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems","primary_cat":"cs.CL","submitted_at":"2026-05-11T17:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RUBEN discovers minimal rule sets explaining RAG LLM outputs via novel pruning and applies them to evaluate LLM safety against adversarial injections.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08503","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"NARRA-Gym for Evaluating Interactive Narrative Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T21:36:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08326","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LLM Advertisement based on Neuron Auctions","primary_cat":"cs.LG","submitted_at":"2026-05-08T16:54:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Neuron Auctions auction continuous neuron intervention budgets on brand-specific orthogonal subspaces in LLMs to achieve strategy-proof revenue optimization while penalizing user utility loss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the ACM Web Conference 2026, pages 261-272, 2026. [2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 33:1877-1901, 2020. [3] Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations? In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607-15631, Toronto, Canada, July 2023. Association for Computational"},{"citing_arxiv_id":"2605.05403","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-06T19:36:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Can large language models be an alternative to hu- man evaluation? InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 15607-15631, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.870. URL https://aclanthology.org/2023.acl-long.870/. [12] Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems 31 (NeurIPS 2017), pages 4302-4310. Curran Associates, Inc., Decem- ber 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ d5e2c0adad503c91f91df240d0cd4e49-Paper."},{"citing_arxiv_id":"2604.20726","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization","primary_cat":"cs.CL","submitted_at":"2026-04-22T16:12:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18835","ref_index":44,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring","primary_cat":"cs.CL","submitted_at":"2026-04-20T20:59:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06666","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM","primary_cat":"cs.CL","submitted_at":"2026-04-08T04:34:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming SOTA results.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"70 1.26 2.02 4.424.384.65 overlaps and fail to capture semantic accuracy and reasoning quality. Recently, Chen et al. [12] has demonstrated that GPT performs well in evaluating text quality from multiple perspectives, even without reference texts. Additionally, studies have shown that LLM-based evaluation closely aligns with expert human assessments [15, 26]. Therefore, we utilize GPT-3.5 to evaluate the quality of explanations based on four widely adopted human evaluation metrics [66, 74]:misleadingness, informativeness,soundness, andreadability. Each metric is rated on a 5-point Likert scale, where 1 represents the lowest score and 5 the highest, except for misleadingness. The definitions of these"},{"citing_arxiv_id":"2604.03127","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts","primary_cat":"cs.CL","submitted_at":"2026-04-03T15:49:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"candidate semantic boundary where the topic or pedagogical function shifts. To determine the boundary threshold τ, we leverage sparse ground truth annotations. We identify true boundaries as positions where consecutive labeled utterances have differing labels, and non-boundaries as positions where consecutive labeled utterances share the same label. We then sweep τ over [0.3, 1.0) in increments of 0.01 and select the threshold that maximizes F1 on this boundary classification problem. We additionally impose a safety cap whereτ cannot exceed the median of all observed similarity scores, preventing over-segmentation when fine-tuned embeddings exhibit representation collapse. Chunks are created by splitting at positions where sim(i)<τ , respecting a minimum size of 2 and maximum of 20 utterances. 3.3 Domain-Adapted Embeddings To improve semantic separation of tutoring dialogue labels in the embedding space, we fine-tune BGE-large-en-v1.5 on labeled utterances from the TalkMoves and Eedi training sets combined. We combine both datasets because they share the same label taxonomy, and pooling increases the diversity of in-batch negatives for MNRL while improving coverage of rare labels that are underrepresented in either dataset alone. Training on both classroom and dyadic chat data also encourages the embedding space to capture pedagogical function across interaction formats rather than surface patterns specific to one setting. To prevent data leakage through the embedding geometry, we use only training-split utterances for fine- tuning. Training data consists of (anchor, positive) pairs sampled within each label group, capped at 3,000 pairs per label to prevent frequent labels from dominating the gradient. We optimize with MNRL (Henderson et al., 2017), where all other in-batch examples serve as implicit negatives. For a batch ofBpairs, the loss for pairiis Li =−log exp(sim(ai,p i)/T) ∑B j=1 exp(sim(ai,p j)/T) (2) where sim is cosine similarity and T is a temp"},{"citing_arxiv_id":"2512.23578","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models","primary_cat":"cs.CL","submitted_at":"2025-12-29T16:23:54+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.21582","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents","primary_cat":"cs.CL","submitted_at":"2025-06-17T05:24:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VIDEE introduces a human-in-the-loop system using Monte-Carlo Tree Search for task decomposition, executable pipeline generation, and LLM-based evaluation with visualizations to support non-expert text analytics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}