{"total":23,"items":[{"citing_arxiv_id":"2606.29808","ref_index":25,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework","primary_cat":"cs.HC","submitted_at":"2026-06-29T05:40:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25189","ref_index":60,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"ActPlane: Programmable OS-Level Policy Enforcement for Agent Harnesses","primary_cat":"cs.OS","submitted_at":"2026-06-23T21:33:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ActPlane enforces agent-declared policies at OS level using IFC DSL and eBPF, improving compliance on indirect paths with 1.9-8.4% overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24115","ref_index":12,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy","primary_cat":"cs.CV","submitted_at":"2026-06-23T04:04:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"White-box method ReXTrust achieves highest AUC (peak 93.0) on Gut-VLM across five VLMs, outperforming alternatives by statistically significant margins while black-box and some gray-box methods collapse on certain models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26730","ref_index":9,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers","primary_cat":"cs.CL","submitted_at":"2026-05-26T09:06:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRISM benchmark finds LLMs match or exceed humans on isolated review dimensions like novelty verification but none achieve the balanced performance of human reviewers across depth, flaw prioritization, and constructiveness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24171","ref_index":13,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection","primary_cat":"cs.LG","submitted_at":"2026-05-22T19:44:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PromptAudit evaluates five prompting strategies across five LLMs on 1000 CVEs and finds chain-of-thought prompting yields the strongest overall performance while adaptive chain-of-thought and self-consistency reduce effective results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23180","ref_index":61,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"Self-Improving In-Context Learning","primary_cat":"cs.CL","submitted_at":"2026-05-22T03:01:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A test-time zeroth-order optimization of prompt embeddings using a bounded self-supervised proxy from demonstration log-probabilities improves ICL accuracy and correlates with gains across tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20473","ref_index":39,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"Code Generation by Differential Test Time Scaling","primary_cat":"cs.SE","submitted_at":"2026-05-19T20:39:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18302","ref_index":36,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience","primary_cat":"cs.HC","submitted_at":"2026-05-18T12:20:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GPT produces click distributions significantly different from real humans in 53% of UX first-click tasks, with prompting techniques like personas and chain-of-thought failing to improve alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15238","ref_index":3,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support","primary_cat":"cs.SE","submitted_at":"2026-05-14T03:18:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13052","ref_index":17,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"RAG-Enhanced Large Language Models for Dynamic Content Expiration Prediction in Web Search","primary_cat":"cs.IR","submitted_at":"2026-05-13T06:20:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"An LLM framework with RAG predicts query-specific validity horizons for web content expiration and shows gains in production A/B tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12370","ref_index":21,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"Context Convergence Improves Answering Inferential Questions","primary_cat":"cs.CL","submitted_at":"2026-05-12T16:39:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Passages made from high-convergence sentences improve LLM performance on inferential questions compared to cosine similarity selection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the degree to which a hint filters out irrelevant or incorrect candi- dates, thereby steering the reasoning process toward the correct answer. A highly convergent hint provides precise guidance that sharply narrows the candidate set, while a low-convergence hint offers only vague or weak constraints. To operationalize this notion, we adopt a three-stage procedure inspired by prior work [ 21, 23]. First, an LLM generates up to 20 plausible candidate answers for the question, approximating the space of reasonable responses. Second, for each candidate, the model evaluates whether the hint applies to it, producing a binary judgment (Yes/No). This step determines which candidates remain consistent with the hint. Finally, the convergence score is computed as the proportion"},{"citing_arxiv_id":"2604.25618","ref_index":54,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding","primary_cat":"cs.MM","submitted_at":"2026-04-28T13:24:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CUCI-Net abstracts context-utterance dependency into an interpretation cue that combines local modality signals with global context and feeds it into the final multimodal interaction for context-conditioned predictions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"https://aclanthology.org/2025.coling- main.272/ [53] Qinfu Xu, Yiwei Wei, Chunlei Wu, Leiquan Wang, Shaozu Yuan, Jie Wu, Jing Lu, and Hengyang Zhou. 2025. Towards Multimodal Sentiment Analysis via Hierarchical Correlation Modeling with Semantic Distribution Constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 21788-21796. [54] Hongfei Xue, Linyan Xu, Yu Tong, Rui Li, Jiali Lin, and Dazhi Jiang. 2024. Break- through from Nuance and Inconsistency: Enhancing Multimodal Sarcasm Detec- tion with Context-Aware Self-Attention Fusion and Word Weight Calculation.. In Proceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC-COLING 2024)."},{"citing_arxiv_id":"2604.19520","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SimDiff: Depth Pruning via Similarity and Difference","primary_cat":"cs.AI","submitted_at":"2026-04-21T14:43:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16576","ref_index":9,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability","primary_cat":"cs.IR","submitted_at":"2026-04-17T13:02:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"BERT [14], enabled the modern dense retrieval paradigm, in which bi-encoders map queries and documents into a shared embedding space and compute relevance via dot product or cosine similarity. DPR [33] established the effectiveness of contrastive fine-tuning, and subsequent work improved the paradigm through stronger pre-training [29, 73], knowl- edge distillation [28, 40], and multilingual modeling [9]. With the rise of decoder-only LLMs-including LLaMA [ 67], Mistral [30], and Qwen [2]-researchers began adapting them into dense encoders through various pooling strate- gies, typically last-token pooling [47, 49]. A later generation of instruction-augmented retrievers, including Linq [12], Manuscript submitted to ACM 4 Yongkang Li, Panagiotis Eustratiadis, Yixing Fan, and Evangelos Kanoulas"},{"citing_arxiv_id":"2605.18765","ref_index":15,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"STAR: Semantic-Tuned and Tail-Adaptive Retriever for Graph-Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-04-11T10:16:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"STAR is a semantic-tuned and tail-adaptive retriever for GraphRAG that uses cross-attention interaction learning and path-weighted contrastive learning to mitigate Semantic Shortcut Bias and Long-Tail Path Bias, reporting 1.8% retrieval and 2.2% QA gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05160","ref_index":3,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems","primary_cat":"cs.CY","submitted_at":"2026-04-06T20:47:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multi-agent generate-validate-revise framework reduces failures in realism and authenticity for LLM-personalized math problems, with one iteration helping and different strategies varying by criterion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.11689","ref_index":17,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks","primary_cat":"cs.AI","submitted_at":"2026-03-12T08:56:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces Explicit Logic Channel (ELC) with LLM, VFM and probabilistic inference for validating, selecting and enhancing MLLMs on zero-shot tasks using Consistency Rate and cross-channel integration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00991","ref_index":84,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"Tracking Capabilities for Safer Agents","primary_cat":"cs.AI","submitted_at":"2026-03-01T08:39:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AI agents can generate code in a capability-safe Scala dialect that statically prevents information leakage and malicious side effects while preserving task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22123","ref_index":124,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"Multilingual Vision-Language Models, A Survey","primary_cat":"cs.CL","submitted_at":"2025-09-26T09:46:13+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[8, 69, 142] that, despite predominantly English training data and fewer parameters, demonstrate strong performance across select languages. Training often incorporates parallel data, though performance degrades with decreasing digital resource availability-in extreme cases, models may struggle to distinguish grammatical from ungrammatical sentences [124]. 3.3 Image Representation Image representation in vision-language models relies on several architectures that extract visual features. Vision Transformers (ViT) [42] was invented as a competitive alternative to Convolutional Neural Networks (CNNs), with ViT models trained on large datasets like ImageNet-21k [118] at 224×224 resolution. ViT decomposes input images into"},{"citing_arxiv_id":"2509.07553","ref_index":77,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents","primary_cat":"cs.CL","submitted_at":"2025-09-09T09:46:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserving normal performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.15736","ref_index":39,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research","primary_cat":"cs.CL","submitted_at":"2025-07-21T15:43:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":131,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Moreover, some studies have explored applying these benchmarks in real-world code development scenarios for automatic code generation and evaluation [243, 744]. • Commonsense Puzzle: Commonsense puzzle benchmarks, including LiveBench [ 850], BIG- Bench Hard [705] and ZebraLogic [450], assess models' ability to reason about commonsense situations. The ARC [131] and DRE-Bench [947] is often viewed as a challenging commonsense- based AGI test. JustLogic [87] further contributes to the evaluation of deductive reasoning and commonsense problem-solving. Moreover, Li et al. [382] introduce QuestBench, a benchmark designed to evaluate the ability of RLLMs to generate insightful and meaningful questions. The second focus area concerns Knowledge Benchmarks, essential for evaluating a model's capability"},{"citing_arxiv_id":"2501.16150","ref_index":48,"ref_count":1,"confidence":0.5,"is_internal_anchor":false,"paper_title":"A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions","primary_cat":"cs.AI","submitted_at":"2025-01-27T15:44:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In contrast, the personal computer domain, despite its significant Agents for Computer Use•6:9 Table 2. Our classification ofaction typesacross the different domains, along with relevant examples for each domain. Action types Web Android Personal computer Mouse/touch/keyboardMouse/touch/keyboard [59] Touch and keyboard [152] Mouse and keyboard [120] Direct UI accessHTML elements [48] Android elements [174] Custom [10], UI automation API [173] Task-tailored actionsFind on page [105] Go back [174] Switch application [9], send email [155] Executable codeJavaScript, Python [138], Selenium web driver [46] Android debug bridge [26] UI automation API [162], Bash [137] practical relevance in workplace automation and productivity applications, remains underexplored: Only10out"}],"limit":50,"offset":0}