{"total":19,"items":[{"citing_arxiv_id":"2606.17905","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions","primary_cat":"cs.CL","submitted_at":"2026-06-16T13:28:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ChLogic benchmark shows persistent English-Chinese gaps in LLM logical reasoning performance, with back-translation effects varying by model and difficulty.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30087","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison","primary_cat":"cs.AI","submitted_at":"2026-05-28T15:33:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a benchmark with 34,560 instances for selective QA over conflicting multi-source personal memory and compares fusion methods against LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24981","ref_index":116,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model Selection with Limited Annotations","primary_cat":"cs.CL","submitted_at":"2026-05-24T10:18:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18565","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-18T15:43:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MINTEval benchmark shows current memory-augmented systems average 27.9% accuracy on long-horizon interference tasks, limited by retrieval and memory construction with degradation from intervening updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18380","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi","primary_cat":"cs.AI","submitted_at":"2026-05-18T13:26:14+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"QSTRBench is a new benchmark evaluating LLMs on compositional reasoning, converse relations, and conceptual neighbourhoods across QSTR calculi including a newly published RCC-22 CN, showing models exceed chance but fail to achieve consistent correctness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17625","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents","primary_cat":"cs.AI","submitted_at":"2026-05-17T19:44:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A dual-process memory architecture for scientific AI agents maintains 70-85% accuracy over 15,000 messages by using a constant 10-message episodic window and domain-specific semantic consolidation, consuming 62% fewer tokens than full-context baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08966","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VORT: Adaptive Power-Law Memory for NLP Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-09T14:20:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Processing Systems, volume 30, pages 5998-6008, 2017. [44] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020. [45] J. Welbl, P. Stenetorp, and S. Riedel. Constructing datasets for multi-hop reading comprehension across documents.Transactions of the ACL, 6:287-302, 2018. [46] J. Weston et al. Towards AI-complete question answering: A set of prerequisite toy tasks.arXiv preprint arXiv:1502.05698, 2015. [47] N. Wiener.Extrapolation, Interpolation and Smoothing of Stationary Time Series. MIT Press, Cambridge, MA, 1949. [48] S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated linear attention transformers with hardware-efficient training."},{"citing_arxiv_id":"2605.05741","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory","primary_cat":"cs.AI","submitted_at":"2026-05-07T06:32:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00817","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-01T17:55:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new benchmark shows LLM first-answer accuracy on procedural arithmetic drops from 63% (5 steps) to 20% (95 steps) due to execution failures like skipped steps and premature answers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15009","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching","primary_cat":"cs.AI","submitted_at":"2026-04-16T13:36:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11575","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts","primary_cat":"cs.CL","submitted_at":"2026-04-13T14:53:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.16264","ref_index":130,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Data-Constrained Language Models","primary_cat":"cs.CL","submitted_at":"2023-05-25T17:18:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Adelani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, et al. 2022. BLOOM+ 1: Adding Language Support to BLOOM for Zero-Shot Prompting. arXiv preprint arXiv:2212.09535. [129] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 18 [130] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. GLM-130B: An Open Bilingual Pre-trained Model. arXiv preprint arXiv:2210.02414. [131] Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, et al. 2021. PanGu-alpha: Large-scale Autoregres-"},{"citing_arxiv_id":"2201.02177","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets","primary_cat":"cs.LG","submitted_at":"2022-01-06T18:43:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.04286","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UW-BHI at MEDIQA 2019: An Analysis of Representation Methods for Medical Natural Language Inference","primary_cat":"cs.IR","submitted_at":"2019-07-09T16:47:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Compares BERT, ESP, and Cui2Vec embeddings within ESIM on the MedNLI shared-task dataset to assess performance and internal representations for medical inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.08942","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Be Consistent! Improving Procedural Text Comprehension using Label Consistency","primary_cat":"cs.CL","submitted_at":"2019-06-21T04:29:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A label consistency training framework improves F1 on the ProPara benchmark for procedural text comprehension by using multiple independent descriptions of the same process.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.08570","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hindi Question Generation Using Dependency Structures","primary_cat":"cs.CL","submitted_at":"2019-06-20T12:05:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A rule-based system using karaka-dependency structures and IndoWordNet generates significantly more diverse Hindi questions than input sentences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1807.03819","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Universal Transformers","primary_cat":"cs.CL","submitted_at":"2018-07-10T18:39:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Published as a conference paper at ICLR 2019 UNIVERSAL TRANSFORMERS Mostafa Dehghani∗† Stephan Gouws∗ Oriol Vinyals University of Amsterdam DeepMind DeepMind dehghani@uva.nl sgouws@google.com vinyals@google.com Jakob Uszkoreit Łukasz Kaiser Google Brain Google Brain usz@google.com lukaszkaiser@google.com ABSTRACT Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence"},{"citing_arxiv_id":"1611.09268","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MS MARCO: A Human Generated MAchine Reading COmprehension Dataset","primary_cat":"cs.CL","submitted_at":"2016-11-28T18:14:11+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1606.06565","ref_index":164,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Concrete Problems in AI Safety","primary_cat":"cs.AI","submitted_at":"2016-06-21T13:37:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For the former challenge, there has been some recent promising work on pinpointing aspects of a structure that a model is uncertain about [162, 81], as well as obtaining calibration in structured output settings [83], but we believe there is much work yet to be done. For the latter challenge, there is also relevant work based on reachability analysis [93, 100] and robust policy improvement [164], which provide potential methods for deploying conservative policies in situations of uncertainty; to our knowledge, this work has not yet been combined with methods for detecting out-of-distribution failures of a model. Beyond the structured output setting, for agents that can act in an environment (such as RL agents), 18 information about the reliability of percepts in uncertain situations seems to have great potential"}],"limit":50,"offset":0}