{"total":16,"items":[{"citing_arxiv_id":"2605.11679","ref_index":60,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion","primary_cat":"cs.AI","submitted_at":"2026-05-12T07:38:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08632","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding","primary_cat":"cs.CL","submitted_at":"2026-05-09T02:50:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05851","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hypothesis generation and updating in large language models","primary_cat":"cs.LG","submitted_at":"2026-05-07T08:24:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05365","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ZAYA1-8B Technical Report","primary_cat":"cs.AI","submitted_at":"2026-05-06T18:44:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04913","ref_index":27,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training","primary_cat":"cs.CL","submitted_at":"2026-05-06T13:41:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27861","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning","primary_cat":"cs.CR","submitted_at":"2026-04-30T13:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06854","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-08T09:17:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Domain-adapted clinical LLMs provide only marginal and unstable gains over general models on English clinical MCQA benchmarks, while new Spanish Marmoka models perform better.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.18425","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kimi-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2025-04-25T15:31:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.05299","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SmolVLM: Redefining small and efficient multimodal models","primary_cat":"cs.AI","submitted_at":"2025-04-07T17:58:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.09992","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Diffusion Models","primary_cat":"cs.CL","submitted_at":"2025-02-14T08:23:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.05171","ref_index":171,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach","primary_cat":"cs.LG","submitted_at":"2025-02-07T18:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.02737","ref_index":242,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model","primary_cat":"cs.CL","submitted_at":"2025-02-04T21:43:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.13106","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-22T18:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Document DocVQA [40], Docmatix [41] 1.31M Chart/Figure ChartQA [42], MMC_Instruction [83], DVQA [84], LRV_Instruction [85], Chart- Gemma [86], InfoVQA [87], PlotQA [88] 1.00M OCR MultiUI [89], in-house data 0.83M Grounding RefCoco [90], VCR [91], in-house data 0.50M Multi-Image Demon-Full [92], Contrastive_Caption [93] 0.41M Text-only Magpie [94], Magpie-Pro [94], Synthia [95], Infinity-Instruct-subjective [82], Numina- Math [96] 2.21M Video & Text Data General LLaVA-Video-178K [25], ShareGPT4o-Video [28], FineVideo [97], CinePile [98], ShareGemini-k400 [99], ShareGemini-WebVID [99], VCG-Human [22], VCG-Plus [22], VideoLLaMA2 in-house data, Temporal Grounding in-house data 2.92M In this stage, we perform instruction tuning with instruction-following data to refine the model's ability to"},{"citing_arxiv_id":"2412.10302","ref_index":99,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding","primary_cat":"cs.CV","submitted_at":"2024-12-13T17:37:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05271","ref_index":267,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"on these hallucination evaluation benchmarks, some hallucinations are still inevitably present when generating long responses in practical use. This is a challenge we plan to tackle in future work. 21 Model Name RefCOCO RefCOCO+ RefCOCOg avg.val test-A test-B val test-A test-B val test Grounding-DINO-L [153] 90.6 93.2 88.2 82.8 89.0 75.9 86.1 87.0 86.6 UNINEXT-H [267] 92.6 94.3 91.5 85.2 89.6 79.8 88.7 89.4 88.9 ONE-PEACE [247] 92.6 94.2 89.3 88.8 92.2 83.2 89.2 89.3 89.8 Shikra-7B [27] 87.0 90.6 80.2 81.6 87.4 72.1 82.3 82.2 82.9 Ferret-v2-13B [297] 92.6 95.0 88.9 87.4 92.1 81.4 89.4 90.0 89.6 CogVLM-Grounding-17B [248] 92.8 94.8 89.0 88.7 92.9 83.4 89.8 90.8 90.3 MM1.5 [296] - 92.5 86.7 - 88.7 77.8 - 87.1 - Qwen2-VL-7B [246] 91."},{"citing_arxiv_id":"2408.03326","ref_index":144,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-OneVision: Easy Visual Task Transfer","primary_cat":"cs.CV","submitted_at":"2024-08-06T17:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777-9786, June 2021. 11, 38, 40 [143] Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. 12 [144] Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. ArXiv, abs/2406.08464, 2024. 36, 37, 39 [145] Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, and Lifu Huang. Vision-flan: Scaling human-labeled tasks in visual instruction tuning."}],"limit":50,"offset":0}