{"total":15,"items":[{"citing_arxiv_id":"2605.30963","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AMix-2: Establishing Protein as a Native Modality in Large Language Models","primary_cat":"q-bio.BM","submitted_at":"2026-05-29T07:58:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AMix-2 unifies protein sequences and text in one LLM via shared tokens and block-wise diffusion modeling, introduces the ProteinArena benchmark, and reports competitive performance against task-specific protein models and frontier LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22211","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CLORE: Content-Level Optimization for Reasoning Efficiency","primary_cat":"cs.AI","submitted_at":"2026-05-21T09:16:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16462","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Asking Back: Interaction-Layer Antidistillation Watermarks","primary_cat":"cs.CR","submitted_at":"2026-05-15T08:28:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11317","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SOMA: Efficient Multi-turn LLM Serving via Small Language Model","primary_cat":"cs.CL","submitted_at":"2026-05-11T23:07:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"2 Soft Prompt Tuning for Mining Weak Alignment Directions We learn P by minimizing a differentiable objective that makes G move away from F on the local dialogue context. The objective has three components: a token-level semantic divergence loss, an expectation-weighted distribution-level divergence term, and an anti-degeneration regularizer. Semantic divergence loss.We use an unlikelihood-style loss [ 49] to discourage G from assigning high probability to the tokens produced by F . A purely exact-token penalty is too narrow, since the surrogate could avoid the teacher token while still assigning high probability to close paraphrases. We therefore define a semantic neighborhood in the surrogate embedding space. Definition 3.1(Semantic Neighborhood)."},{"citing_arxiv_id":"2605.10466","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition","primary_cat":"cs.LG","submitted_at":"2026-05-11T12:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09995","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Annotations Mitigate Post-Training Mode Collapse","primary_cat":"cs.CL","submitted_at":"2026-05-11T05:11:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07977","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback","primary_cat":"cs.LG","submitted_at":"2026-05-08T16:35:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect ones, outperforming baselines without ground-truth contexts.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"when built upon algorithms such as GRPO due to multiple generations required for group calculation [33]. Compared to prior work, SPEAR is crafted to operate inonline, imperfect feedback settings. Additionally, SPEAR is capable ofpreventing the reinforcement of incorrect feedback-based outputs, which can lead to model collapse [36]. Unlikelihood Training:Unlikelihood training [ 41], in contrast with standard supervised fine-tuning (SFT), seeks to minimize the probability that certain tokens are generated by taking the complement of the maximization probability. Unlikelihood training has been shown to effective in discouraging certain outputs over various use cases, such as overuse of frequent words and excessive repetition [23, 20]."},{"citing_arxiv_id":"2604.17323","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Universal Avoidance Method for Diverse Multi-branch Generation","primary_cat":"cs.CL","submitted_at":"2026-04-19T08:31:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAG is a universal avoidance generation method that increases multi-branch diversity in diffusion and transformer models by penalizing output similarity, delivering up to 1.9x higher diversity with 4.4x speed and 1/64th the FLOPs of prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.01237","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Differences Between Direct Alignment Algorithms are a Blur","primary_cat":"cs.LG","submitted_at":"2025-02-03T10:54:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.07199","ref_index":158,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents","primary_cat":"cs.AI","submitted_at":"2024-08-13T20:52:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.07691","ref_index":136,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ORPO: Monolithic Preference Optimization without Reference Model","primary_cat":"cs.CL","submitted_at":"2024-03-12T14:34:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.11411","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aligning Modalities in Vision Large Language Models via Preference Fine-tuning","primary_cat":"cs.LG","submitted_at":"2024-02-18T00:56:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.18290","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Direct Preference Optimization: Your Language Model is Secretly a Reward Model","primary_cat":"cs.LG","submitted_at":"2023-05-29T17:57:46+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019. [47] R. J. Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Mach. Learn. , 8(3-4):229-256, may 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696. [48] Y . Wu and B. Hu. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , AAAI'18/IAAI'18/EAAI'18. AAAI Press,"},{"citing_arxiv_id":"2009.01325","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to summarize from human feedback","primary_cat":"cs.CL","submitted_at":"2020-09-02T19:54:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1909.05858","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CTRL: A Conditional Transformer Language Model for Controllable Generation","primary_cat":"cs.CL","submitted_at":"2019-09-11T17:57:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}