{"total":20,"items":[{"citing_arxiv_id":"2605.05365","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ZAYA1-8B Technical Report","primary_cat":"cs.AI","submitted_at":"2026-05-06T18:44:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00364","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning","primary_cat":"cs.CL","submitted_at":"2026-05-01T02:59:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TokenUnlearn identifies critical tokens via masking and entropy signals then applies hard selection or soft weighting to unlearn only those tokens, yielding better forgetting and retained utility than sequence-level baselines on TOFU and WMDP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24819","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora","primary_cat":"cs.SE","submitted_at":"2026-04-27T14:05:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22989","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging","primary_cat":"cs.CV","submitted_at":"2026-04-24T20:03:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21926","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:59:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19118","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs","primary_cat":"cs.CR","submitted_at":"2026-04-21T05:56:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DP-FLogTinyLLM combines federated learning, differential privacy, and LoRA-tuned tiny LLMs to match centralized log anomaly detection performance on Thunderbird and BGL datasets while preserving privacy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18936","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Fine-Tuning Small Reasoning Models for Quantum Field Theory","primary_cat":"cs.LG","submitted_at":"2026-04-21T00:21:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"training on \"textbook-quality\" synthetic data allows significantly smaller models to rival larger counterparts. This data-centric approach focuses on generating clear, instructional examples that make the reasoning process explicit. This paradigm extends to reasoning distillation, where student models learn from the reasoning traces of stronger teacher models via supervised fine-tuning (SFT). The foundational idea of knowledge distillation [36] has been adapted to the LLM setting: Orca [37] showed that training a 13B model on explanation traces from GPT-4 enables it to surpass larger instruction-tuned baselines, and Distilling Step-by-Step [38] demonstrated that a 770M model trained on LLM-generated rationales can outperform the 540B PaLM teacher, while MiniLLM [39] introduced on-policy distillation to reduce the train-test distribution mismatch"},{"citing_arxiv_id":"2604.09952","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SLM Finetuning for Natural Language to Domain Specific Code Generation in Production","primary_cat":"cs.LG","submitted_at":"2026-04-10T23:11:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losing general capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11810","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization","primary_cat":"cs.DB","submitted_at":"2026-04-09T14:08:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRACE dynamically constructs and updates coresets for LLM training using representation diversity, gradient-based importance, and k-NN graph propagation to improve efficiency and performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08121","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator","primary_cat":"cs.CV","submitted_at":"2026-04-09T11:41:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose generative knowledge for discriminative tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02714","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-04-03T04:14:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.02737","ref_index":196,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model","primary_cat":"cs.CL","submitted_at":"2025-02-04T21:43:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.12528","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Show-o: One Single Transformer to Unify Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2024-08-22T16:32:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.00118","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gemma 2: Improving Open Language Models at a Practical Size","primary_cat":"cs.CL","submitted_at":"2024-07-31T19:13:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.09246","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OpenVLA: An Open-Source Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2024-06-13T15:46:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.00515","ref_index":153,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Large Language Models for Code Generation","primary_cat":"cs.CL","submitted_at":"2024-06-01T17:48:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.14219","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","primary_cat":"cs.CL","submitted_at":"2024-04-22T14:32:33+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.07974","ref_index":272,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","primary_cat":"cs.SE","submitted_at":"2024-03-12T17:58:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06196","ref_index":211,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-02-09T05:37:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"are the so called State Space Models (SSMs). While the notion of State Space Models has a long history in machine learning, it should be noted that in the context of language models, SSM is usually used in reference to the newer Structure State Space Model architecture or S4 for short (see Gu et al. [29]). Some recent models in this category are Mamba [30], Hyena [210], and Striped Hyena [211]. While all of those models are very competitive in terms of performance in leaderboards and efficiency, they also address an important challenge in more traditional attention-based architectures: the lack of support for larger context windows . Having a good answer to many prompts requires context. For example, the response to \"Recommend some good movies"},{"citing_arxiv_id":"2311.05232","ref_index":184,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","primary_cat":"cs.CL","submitted_at":"2023-11-09T09:25:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[195] took the first step towards evaluating the object hallucina- tions in the LVLMs. Evaluations and experiments reveal that current LVLMs are prone to generate inconsistent responses with respect to the associated image, including non-existent objects, wrong object types, and attributes, incorrect semantic relationships, etc. [315, 361]. Furthermore, Liu et al. [185], Zong et al. [395] and Liu et al. [184] show that LVLMs can be easily fooled and experience a severe performance drop due to their over-reliance on the strong language prior, as well as its inferior ability to defend against inappropriate user inputs [112, 134]. Jiang et al. [138], Wang et al. [315] and Jing et al. [141] took a step forward to holistically evaluate multi-modal hallucination."}],"limit":50,"offset":0}