{"total":39,"items":[{"citing_arxiv_id":"2606.29685","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CAREBench: A Child-Safety Risk Benchmark for Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-29T01:17:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAREBench is a new benchmark with 500 prompts in 12 risk categories that measures how often frontier LLMs fail to refuse or redirect child-safety risks, reporting failure rates between 2% and 58%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29672","ref_index":24,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How LLMs See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning","primary_cat":"cs.CL","submitted_at":"2026-06-29T00:39:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multimodal LLMs achieve moderate correlations with human visual creativity ratings in zero-shot evaluation across two datasets, with reasoning outputs providing interpretability but no accuracy gain.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29630","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SFBench: The SciFy Scientific Feasibility Benchmark","primary_cat":"cs.AI","submitted_at":"2026-06-28T22:27:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SFBench provides 197 expert-created materials science claims with feasibility scores and explanations to evaluate AI systems on scientific feasibility assessment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28070","ref_index":89,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JD Oxygen AI Item Center (Oxygen AIIC) V1: An Industrial-Scale LLM/VLM-Centric Solution for Item Understanding, Management, and Applications","primary_cat":"cs.AI","submitted_at":"2026-06-26T13:33:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Oxygen AIIC is an industrial platform using LLMs and VLMs for scalable item knowledge production and service at JD.com, reporting 94.2% precision and 82.8% recall along with business metric improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22737","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation","primary_cat":"cs.AI","submitted_at":"2026-06-22T00:41:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GroundEval is a judge-free framework that generates questions from a domain config, records agent trajectories, and scores answers plus evidence paths on Silence, Perspective, and Counterfactual tracks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10315","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents","primary_cat":"cs.CL","submitted_at":"2026-06-09T02:11:01+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical study of a production multi-turn ordering agent finds LLM-as-judge recall below 25% for human-confirmed defects, missing cross-turn state issues due to limited rubric and routing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09697","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-08T16:19:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PsychoSafe is a psychologically-informed refusal framework that improves LLM refusal quality by 28.1% via prompting and fine-tuning on an 8019-pair corpus across five risk domains, with strong in-domain but limited out-of-domain results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09603","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Automated IEP Generation from Traditional Chinese Parent-Teacher Interviews via Corpus-Grounded Feature Diffusion","primary_cat":"cs.CL","submitted_at":"2026-06-08T15:13:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Corpus-Grounded Feature Diffusion generates synthetic Traditional Chinese IEP training data from 25 seeds to fine-tune a 7B model that reaches BERTScore F1 0.779 on a 10-sample hold-out, beating several larger zero-shot models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03650","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks","primary_cat":"cs.CL","submitted_at":"2026-06-02T13:41:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01462","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-05-31T21:46:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01022","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProductWebGen: Benchmarking Multimodal Product Webpage Generation","primary_cat":"cs.CV","submitted_at":"2026-05-31T05:25:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26734","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains","primary_cat":"cs.CV","submitted_at":"2026-05-26T09:11:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CIRCLED is a multi-turn CIR dataset of 22,608 sessions generated from existing single-turn datasets via CIReVL pipeline and curated with filters for consistency, scale, and generality across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21622","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-20T18:32:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20478","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables","primary_cat":"cs.CL","submitted_at":"2026-05-19T20:41:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stage-Audit raises source-frontier precision from 0.356 to 0.505 and F1 from 0.334 to 0.451 on a 51-instance cross-domain set by enforcing disjoint write rights and row-level source gates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19529","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment","primary_cat":"cs.AI","submitted_at":"2026-05-19T08:30:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Defines GEA validity criterion and reports first measurement of r=0.698 recovery with positive bias in LLM two-stage adaptive assessment, stronger for verifiable skills.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10850","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:00:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10379","ref_index":46,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness","primary_cat":"cs.CL","submitted_at":"2026-05-11T11:23:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProofRank benchmark shows substantial differences in LLM proof quality not captured by correctness, with trade-offs between quality metrics and accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum?id=3GTtZFiajM. 13 [45] Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. CoRR, abs/2410.21819, 2024. doi: 10 .48550/ARXIV.2410.21819. URL https://doi.org/ 10.48550/arXiv.2410.21819. [46] José Pombal, Ricardo Rei, and André F. T. Martins. Self-preference bias in rubric-based evaluation of large language models.CoRR, abs/2604.06996, 2026. doi: 10 .48550/ ARXIV.2604.06996. URLhttps://doi.org/10.48550/arXiv.2604.06996. [47] Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in pref- erence labeling by large language models."},{"citing_arxiv_id":"2605.06731","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents","primary_cat":"cs.CR","submitted_at":"2026-05-07T12:25:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the StateGuard defense.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"From assistant to double agent: Formalizing and benchmarking attacks on openclaw for personalized local ai agent. arXiv:2602.08412, 2026. [27] Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu, Zhenlong Yuan, Tianyu Pang, Michael Qizhe Shieh, Fengze Liu, et al. Your agent, their asset: A real-world safety analysis of openclaw. arXiv:2604.04759, 2026. [28] Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. arXiv:2410.21819, 2024. [29] Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, and Yingqiang Ge. Clawsafety: \"Safe\" llms, unsafe agents. arXiv:2604.01438, 2026. [30] Zhenlin Xu, Xiaogang Zhu, Yu Yao, Minhui Xue, and Yiliao Song. From storage to steering:"},{"citing_arxiv_id":"2605.03858","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following","primary_cat":"cs.CL","submitted_at":"2026-05-05T15:20:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under perturbations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03472","ref_index":26,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks","primary_cat":"cs.CL","submitted_at":"2026-05-05T07:56:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a clean matched benchmark and Dynamic Emotional Signature Graphs (DESG) framework that detects implicit sycophancy via clinical-state transitions and reports a 0.0488 macro-F1 gain over baselines on harmful-risk detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03147","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls","primary_cat":"cs.CL","submitted_at":"2026-05-04T20:40:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27727","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding","primary_cat":"cs.SE","submitted_at":"2026-04-30T11:20:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26243","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall","primary_cat":"cs.CL","submitted_at":"2026-04-29T02:55:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StratMem-Bench reveals that state-of-the-art LLMs distinguish required from irrelevant memories effectively but struggle to integrate supportive memories in character conversations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24700","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Green Shielding: A User-Centric Approach Towards Trustworthy AI","primary_cat":"cs.CL","submitted_at":"2026-04-27T17:04:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"100K [19], originally used to train medical LLMs, to the diagnosis setting. LLM as a Judge.As the need to evaluate LLM capabilities grows and human labeling remains costly, LLM-as-a-judge [52, 53] has become a widely used procedure for producing scalable evaluation signals for open-ended generation. While LLM judges can exhibit systematic artifacts, including self- preferencebias[54,55]andpositionbias[56], priorworkshaveshownthat, withcarefulpromptdesign, guardrails, and calibration, they can provide reliable and reproducible measurements in evaluation and benchmarking settings [57-60]. In the medical domain, recent works such as MedHELM [61] has also started to heavily rely on LLM judges during evalution. In our framework, we use multiple LLM judges primarily for (i) constructing structured reference diagnosis sets and (ii) matching clinical"},{"citing_arxiv_id":"2604.22517","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement","primary_cat":"cs.CL","submitted_at":"2026-04-24T12:56:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Personalized LLM judges conditioned on an individual evaluator's scoring history align more closely with that evaluator than aggregate judges trained on mixed histories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19598","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-21T15:51:46+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text duplication.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17658","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Self-Improving Error Diagnosis in Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-04-19T23:13:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17197","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning to Control Summaries with Score Ranking","primary_cat":"cs.CL","submitted_at":"2026-04-19T01:58:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12994","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software","primary_cat":"cs.CR","submitted_at":"2026-04-14T17:26:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11287","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model","primary_cat":"cs.AI","submitted_at":"2026-04-13T10:50:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Repeated generations of exercise prescriptions by an LLM showed high semantic consistency but notable variability in quantitative details such as exercise intensity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07650","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles","primary_cat":"cs.AI","submitted_at":"2026-04-08T23:32:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06996","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Self-Preference Bias in Rubric-Based Evaluation of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-08T12:13:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05593","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge","primary_cat":"cs.AI","submitted_at":"2026-04-07T08:43:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.01865","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation","primary_cat":"cs.CL","submitted_at":"2026-03-02T13:46:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.18027","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sentipolis: Emotion-Aware Agents for Social Simulations","primary_cat":"cs.AI","submitted_at":"2026-01-25T22:50:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sentipolis equips LLM agents with continuous PAD emotional states, dual-speed dynamics, and memory coupling to improve emotional continuity and grounded behavior in social simulations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.17230","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CaseFacts: A Benchmark for Legal Fact-Checking and Precedent Retrieval","primary_cat":"cs.CL","submitted_at":"2026-01-23T23:41:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CaseFacts benchmark of 6,294 claims shows LLMs struggle to verify colloquial legal statements against Supreme Court precedents, with unrestricted web search degrading performance due to noisy precedents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.01490","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning","primary_cat":"cs.CL","submitted_at":"2025-11-03T11:57:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Fine-tuning LLMs on multi-source synthetic data mitigates distribution collapse and self-preference bias while increasing output quality relative to single-source or human-only fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.07517","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning","primary_cat":"cs.AI","submitted_at":"2025-10-08T20:29:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the context of single-agent user interactions. Prior work has analyzed sycophantic tendencies, where models uncritically align with external views [46, 5, 6, 7, 8, 9, 10], and explored mitigation strategies [47, 48, 49, 50, 51]. In parallel, another body of work reports self-reliant behavior in LLMs-where models overly adhere to their own prior outputs [52, 53, 54, 55, 11, 12, 13]-with mitigation strategies also being investigated [14, 15]. However, discussions of identity bias in MAD remain scarce, with only a few works addressing sycophancy in this setup [56, 57]. In contrast, our work is, to the best of our knowledge,the first to unify these two phenomena under the broader notion of \"identity bias\", and to propose a method that eliminates it from"},{"citing_arxiv_id":"2509.26464","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Extreme Self-Preference in Language Models","primary_cat":"cs.AI","submitted_at":"2025-09-30T16:13:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Eight LLMs exhibited massive self-preference that followed assigned identities rather than true ones, appearing in both simple word tasks and consequential evaluations of job candidates and AI technologies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}