{"total":21,"items":[{"citing_arxiv_id":"2606.27981","ref_index":178,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ToxiREX: A Dataset on Toxic REasoning in ConteXt","primary_cat":"cs.CL","submitted_at":"2026-06-26T11:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27474","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Speculative Refinement: A Hybrid Autoregressive Diffusion Decoding Strategy and Its Behavior Across Benchmarks","primary_cat":"cs.SE","submitted_at":"2026-06-25T18:52:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpecRef hybrid AR-diffusion decoding is tested on six benchmarks with three protocols, showing code benchmarks conflate structural and logical correctness, refinement can degrade correct tokens, and log-likelihood versus generative scoring produce inconsistent model rankings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26969","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Einstein World Models","primary_cat":"cs.AI","submitted_at":"2026-06-25T12:42:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Einstein World Models integrate visual rollouts from a callable world-module into LLM reasoning traces to support complex thought beyond language.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26560","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention","primary_cat":"cs.CL","submitted_at":"2026-06-25T03:12:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EDA decouples erase and write addresses in delta-rule linear attention by adding a targeted erase step along a learned direction before the corrective write, yielding best results on 2.5B dense and 25B MoE models in pretraining and long-context tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13668","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution","primary_cat":"cs.CL","submitted_at":"2026-06-11T17:58:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Influcoder distills decoders' gradient influence rankings into an encoder for scalable influence-based data attribution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10722","ref_index":26,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs","primary_cat":"cs.CL","submitted_at":"2026-06-09T11:32:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09409","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings","primary_cat":"cs.AI","submitted_at":"2026-06-08T12:26:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Elo rankings from pairwise judgments correlate above 0.9 Spearman with accuracy rankings on five converted benchmarks, with minor style and bias effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08076","ref_index":83,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"\"I understand your perspective\": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory","primary_cat":"cs.CL","submitted_at":"2026-06-06T09:54:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs outperform humans in expressing illocutionary intents and sycophancy in successful persuasive counter-arguments from ChangeMyView, with crowd workers preferring LLM versions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04661","ref_index":218,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts","primary_cat":"cs.CL","submitted_at":"2026-06-03T09:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03624","ref_index":44,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-06-02T13:23:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CRGC models instructions as constraint graphs, identifies bridge constraints, and cuts violations by 39% on three datasets while preserving reasoning performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01007","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference","primary_cat":"cs.LG","submitted_at":"2026-05-31T04:51:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07603","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution","primary_cat":"cs.LG","submitted_at":"2026-05-29T09:31:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MetaEvo is a two-stage framework using preference optimization for principle abstraction followed by modular reuse to enable continual improvement of LLM agents on reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25773","ref_index":73,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Efficient Benchmarking Is Just Feature Selection and Multiple Regression","primary_cat":"stat.ML","submitted_at":"2026-05-25T12:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Kernel ridge regression combined with mRMR feature selection improves prediction of full benchmark scores from question subsets over existing efficient benchmarking techniques.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23102","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LLM Sparsity Prior for Robust Feature Selection","primary_cat":"stat.ML","submitted_at":"2026-05-21T23:34:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LSP adds hierarchical hyperpriors over global sparsity and weight concentration parameters so that spike-and-slab models can discount inaccurate LLM weights while retaining gains when the weights are good.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19597","ref_index":26,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening","primary_cat":"cs.CL","submitted_at":"2026-05-19T09:40:29+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09100","ref_index":40,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression","primary_cat":"cs.CL","submitted_at":"2026-05-09T18:15:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The hyperparameters used in the HPA-based LLM servering system is presented in Table 14, Appendix B. Detailed actual inference time can be found in Table 15. 5 Related work Unifying generation and embedding.Text retrieval [ 38, 39] and generation are inherently com- plementary, which has motivated substantial research effort into improving their interplay [40], RAG being a prominent example [12], where information retrieved from external sources is leveraged to supplement or correct potential errors in the text generation process. However, these two tasks are typically handled by two distinct LLMs, trained with different objectives: retrieval models commonly employ bidirectional attention with contrastive learning [3, 31, 4], whereas generative models rely"},{"citing_arxiv_id":"2605.02028","ref_index":39,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Language models fail at extended rule following","primary_cat":"cs.CL","submitted_at":"2026-05-03T19:27:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01429","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SCALE-LoRA: Auditing Post-Retrieval LoRA Composition with Residual Merging and View Reliability","primary_cat":"cs.AI","submitted_at":"2026-05-02T13:00:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCALE-LoRA proposes a post-retrieval audit framework using sparse residual composition and disagreement-based reliability signals to improve open-pool LoRA adapter reuse on tasks like BIG-Bench Hard.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03679","ref_index":40,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LightThinker++: From Reasoning Compression to Memory Management","primary_cat":"cs.CL","submitted_at":"2026-04-04T10:46:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"internalizesexplicit context engineeringas a core component of its decision-making process. This high-density learningsignalallowsthemodeltomaintaincontexthygieneandreasoningfidelityacrossextendedinteraction horizons. 6 Experiments: Long-Horizon Agentic Reasoning 6.1 Experimental Settings Dataset Construction and Filtering.The base query pool is curated from a diversified ensemble of sources, includingHotpotQA [40], MuSiQue [41], WebDancer [42], WebShaper [43], andWebWalkerQA-Silver [44]. To ensure the necessity of multi-hop reasoning and high-order planning, we perform heuristic filtering on HotpotQAandMuSiQueby selecting only those instances where Qwen3-30B-A3B-Instruct-2507 fails to yield direct solutions. Regarding theWebWalkerQA-Silvercorpus, we adopted a language-specific selection policy:"},{"citing_arxiv_id":"2601.02780","ref_index":45,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MiMo-V2-Flash Technical Report","primary_cat":"cs.CL","submitted_at":"2026-01-06T07:31:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.17297","ref_index":202,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"InternLM2 Technical Report","primary_cat":"cs.CL","submitted_at":"2024-03-26T00:53:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}