{"total":79,"items":[{"citing_arxiv_id":"2605.23190","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement","primary_cat":"cs.CL","submitted_at":"2026-05-22T03:17:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reveals hidden human-like spans in machine-generated texts that raise detection complexity and proposes a stacked enhancement framework with hard-EM optimization to improve detectors across LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19394","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EmbGen: Teaching with Reassembled Corpora","primary_cat":"cs.CL","submitted_at":"2026-05-19T05:40:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[20] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. doi:10.48550/ ARXIV.1606.05250 [21] Arij Riabi, Thomas Scialom, Rachel Keraron, Benoît Sagot, Djamé Seddah, and Jacopo Staiano. 2021. Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering. arXiv:2010.12643 doi:10.48550/arXiv.2010.12643 [22] Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. 2020. BLEURT: Learning Robust Metrics for Text Generation. arXiv:2004.04696 doi:10.48550/arXiv.2004. 04696 [23] Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2024. The Curse of Recursion: Training on Generated Data Makes Models Forget. arXiv:2305.17493 doi:10."},{"citing_arxiv_id":"2605.19316","ref_index":75,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation","primary_cat":"cs.CL","submitted_at":"2026-05-19T03:52:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12340","ref_index":3,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Online Learning-to-Defer with Varying Experts","primary_cat":"stat.ML","submitted_at":"2026-05-12T16:19:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"input-label pair (x t, yt)T t=1 is sampled from one ofnlinearly separable clusters Ω y ={(x t, yt) :y t =y}. Online Learning-to-Defer with Varying Experts To introduce label noise, for eachx t ∈Ω y we flip the true label to a uniformly random alternative class in [n]\\{y} with probabilityp y,t. The noise vector at roundtisp t = (py,t)y∈[n]. We initializep 0 = [0.3,0.3,0.3,0.3,0.0,0.0] and evolve it via a random walk with Gaussian perturbations of mean 0 and standard deviationσ= 2×10 −3. At initializationt= 0, expertg 1 is knowledgeable on classes{1,2}and predicts thepost-noiselabels correctly on inputs from Ω 1 and Ω2, while predicting uniformly at random on other clusters. Expertg 2 is knowledgeable on{3,4}with the same behavior."},{"citing_arxiv_id":"2605.16360","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference","primary_cat":"cs.LG","submitted_at":"2026-05-09T13:18:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProxyKV offloads KV cache importance scoring to a lightweight intra-family small-model proxy with HybridAxialMapper and ranking-focused loss, matching KVZip accuracy while achieving up to 3.21x prefilling speedup on models up to 32B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08696","ref_index":71,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Structured Recurrent Mixers for Massively Parallelized Sequence Generation","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:07:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05974","ref_index":47,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts","primary_cat":"cs.CR","submitted_at":"2026-05-07T10:19:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05838","ref_index":94,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MDN: Parallelizing Stepwise Momentum for Delta Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-07T08:12:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03147","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls","primary_cat":"cs.CL","submitted_at":"2026-05-04T20:40:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03045","ref_index":154,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations","primary_cat":"cs.LG","submitted_at":"2026-05-04T18:12:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00939","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity","primary_cat":"cs.LG","submitted_at":"2026-05-01T04:11:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"placement vector∆gis statistically likely to be orthogonal to the original gradient gclean, resulting in significant angu- lar deviation. Consequently, we define the EPGS score by explicitly coupling the gradient's magnitude with its direc- tional volatility: S=∥g clean∥2| {z } Local Geometry (Curvature Scale) ·(1−CosSim(g clean, gperturbed))| {z } Stochastic Instability (Directional Divergence) (7) where CosSim(u, v) = u·v max(∥u∥2,ϵ)·max(∥v∥2,ϵ) with a stabi- lization constantϵ= 1e −8 to prevent division by zero. This decomposition is critical for robustness. Themagni- tude termcaptures the local geometry (steepness) of the loss landscape, acting as a scaling factor that reflects the model's baseline \"struggle\" to maintain the prediction. The"},{"citing_arxiv_id":"2604.26587","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators","primary_cat":"cs.AR","submitted_at":"2026-04-29T12:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23750","ref_index":30,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation","primary_cat":"cs.LG","submitted_at":"2026-04-26T14:59:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We evaluate on Doc-to-LoRA [3] using three pretrained hypernetwork checkpoints covering back- bones from three different families: Gemma-2B-IT [10] (80K training steps), Qwen-4B-Instruct [1] (20K steps), and Mistral-7B-Instruct-v0.2 [19] (20K steps). All experiments use the official check- 8 points released by Sakana AI. For standard evaluation, we use the SQuAD [30] validation set (500 samples) through the official evaluation pipeline with flash attention disabled for compatibility. We use Gemma-2B as the primary evaluation model since its hypernetwork was trained for the longest (80K steps). Results on the two other backbones are reported in the cross-model subsection below. For KID-Bench evaluation we generate answers with greedy decoding up to 64 new tokens"},{"citing_arxiv_id":"2604.23647","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices","primary_cat":"cs.AR","submitted_at":"2026-04-26T10:34:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06683","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models","primary_cat":"cs.LG","submitted_at":"2026-04-24T20:37:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19342","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Are Large Language Models Economically Viable for Industry Deployment?","primary_cat":"cs.CL","submitted_at":"2026-04-21T11:25:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18349","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents","primary_cat":"cs.CL","submitted_at":"2026-04-20T14:44:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoMo10 benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17943","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents","primary_cat":"cs.CL","submitted_at":"2026-04-20T08:22:15+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12610","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-14T11:36:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"reasoning, we conduct experiments along two complementary axes:Long-context and multi-hop QA benchmarks, and Hot- potQA Robustness Variants. a) Long-context and multi-hop QA benchmarks:We evaluate end-to-end answering performance on LongBench [49] and a set of widely used QA benchmarks, including Hot- potQA [50], 2WikiMultihopQA [51], MuSiQue [52], Natural Questions (NQ) [53], and SQuAD [54]. Collectively, these datasets span single-hop factual QA, compositional multi-hop reasoning, and reading comprehension with varying context lengths, enabling assessment of generalization across different reasoning depths and evidence aggregation patterns. 6 b) HotpotQA Robustness Variants.:We evaluate robust- ness on four variants derived from HotpotQA [50] that cover"},{"citing_arxiv_id":"2604.10741","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation","primary_cat":"cs.CL","submitted_at":"2026-04-12T17:30:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Deep-Reporter introduces a unified agentic framework for grounded multimodal long-form generation via multimodal search, checklist-guided synthesis, and recurrent context management, plus the M2LongBench benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09088","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation","primary_cat":"cs.CV","submitted_at":"2026-04-10T08:16:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the International Conference on Machine Learning, pages 8748-8763, 2021. 2, 3 [77] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1-67, 2020. 5, 3 [78] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine com- prehension of text.arXiv preprint arXiv:1606.05250, 2016. 1 [79] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fit- nets: Hints for thin deep nets. InProceedings of the Inter- national Conference on Learning Representations, 2015."},{"citing_arxiv_id":"2605.04058","ref_index":115,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning","primary_cat":"cs.LG","submitted_at":"2026-04-10T08:00:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09048","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures","primary_cat":"cs.DC","submitted_at":"2026-04-10T07:15:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06794","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering","primary_cat":"cs.CL","submitted_at":"2026-04-08T08:06:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GCoT-decoding combines Fibonacci sampling, heuristic backtracking, span-based confidence scoring, and semantic consensus aggregation to enable general chain-of-thought reasoning without task-specific prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06098","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"JU\\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections","primary_cat":"cs.IR","submitted_at":"2026-04-07T17:10:54+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05483","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-07T06:24:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.25412","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-03-26T13:08:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.22869","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Chain-of-Authorization: Embedding authorization into large language models","primary_cat":"cs.AI","submitted_at":"2026-03-24T07:13:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs fine-tuned to output authorization trajectories as a prerequisite for responses achieve high rejection rates for unauthorized prompts while preserving utility in allowed scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22755","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering","primary_cat":"cs.IR","submitted_at":"2026-03-04T01:30:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RADIANT-LLM is a local-first multi-modal RAG system with provenance tracking that delivers lower hallucination rates than general LLMs on nuclear engineering benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.17261","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-01-24T02:28:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AGZO restricts ZO perturbations to an activation-derived low-rank subspace, claiming higher gradient cosine similarity and better benchmark performance than isotropic ZO baselines on Qwen3 and Pangu models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.04720","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking","primary_cat":"cs.CL","submitted_at":"2026-01-08T08:36:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.06938","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation","primary_cat":"cs.CL","submitted_at":"2025-12-07T17:43:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Progress Ratio Embeddings use a trigonometric progress-ratio signal to deliver stable length control in transformers that generalizes to unseen target lengths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.09282","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering","primary_cat":"cs.SD","submitted_at":"2025-11-12T12:49:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CLSR is an end-to-end contrastive language-speech retriever using an intermediate text-like conversion step to improve retrieval of relevant segments from long audio for spoken question answering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.06516","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations","primary_cat":"cs.CL","submitted_at":"2025-11-09T19:58:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accuracy-memory ratio across benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24496","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLM DNA: Tracing Model Evolution via Functional Representations","primary_cat":"cs.LG","submitted_at":"2025-09-29T09:09:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM DNA is introduced as a low-dimensional bi-Lipschitz functional representation proven to satisfy inheritance and genetic determinism, with a training-free extraction pipeline tested on 305 models to reveal relationships and construct phylogenetic trees.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.18629","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HyperAdapt: Simple High-Rank Adaptation","primary_cat":"cs.LG","submitted_at":"2025-09-23T04:29:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.15229","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models","primary_cat":"cs.CL","submitted_at":"2025-08-21T04:32:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VocabTailor introduces a decoupled dynamic vocabulary selection framework that reduces vocabulary-related memory in SLMs by up to 99% with minimal task performance loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.14913","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation","primary_cat":"cs.CL","submitted_at":"2025-07-20T10:55:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.02259","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent","primary_cat":"cs.CL","submitted_at":"2025-07-03T03:11:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00435","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation","primary_cat":"cs.RO","submitted_at":"2025-07-01T05:33:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboEval is a new benchmark providing eight bimanual tasks, thousands of expert demonstrations, and standardized metrics for efficiency, coordination, safety, and failure localization in robotic manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.14067","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Online Conformal Abstention for Factuality Control Under Adversarial Bandit Feedback","primary_cat":"cs.LG","submitted_at":"2025-06-16T23:51:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExAUL converts any bandit algorithm's regret into an O(sqrt(T)) FDR bound for online conformal abstention under partial adversarial feedback via a conversion lemma and feedback unlocking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.04565","ref_index":142,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems","primary_cat":"cs.MA","submitted_at":"2025-06-05T02:34:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"retrieval pipelines with a single Transformer-based model that indexes and retrieves documents directly through its parameters [168]. Prompt Construction is a framework that focuses on designing and optimizing prompts to enable the generator to efficiently utilize retrieved information, thereby improving the relevance and accuracy of the final response. For example, Ren et al. [142] present new prompting strategies-priori judgement, which evaluates a question before answering, and posteriori judgement, which assesses the correctness of the answer to explore the impact of retrieval augmentation. Modular is a framework where RAG itself is designed as independent modules or applications, allowing for flexibility, easy replacement, and integration of different components."},{"citing_arxiv_id":"2503.22693","ref_index":78,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging Language Models and Financial Analysis","primary_cat":"q-fin.ST","submitted_at":"2025-03-14T01:35:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.14249","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Humanity's Last Exam","primary_cat":"cs.LG","submitted_at":"2025-01-24T05:27:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Evaluating frontier models for dangerous capabilities, 2024. URL https://arxiv.org/abs/2403. 13793. [43] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URLhttps://arxiv.org/abs/1606.05250. [44] P. Rajpurkar, R. Jia, and P. Liang. Know what you don't know: Unanswerable questions for squad, 2018. URLhttps://arxiv.org/abs/1806.03822. [45] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022. [46] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge."},{"citing_arxiv_id":"2411.10915","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bias in Large Language Models: Origin, Evaluation, and Mitigation","primary_cat":"cs.CL","submitted_at":"2024-11-16T23:54:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A literature review that categorizes bias in LLMs, surveys evaluation and mitigation techniques, and discusses ethical implications.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.15761","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees","primary_cat":"cs.CL","submitted_at":"2024-10-21T08:21:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.13903","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment","primary_cat":"cs.CR","submitted_at":"2024-10-16T08:14:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CoreGuard introduces a computation- and communication-efficient protocol claimed to deliver upper-bound security against model stealing for edge-deployed LLMs with negligible overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.17428","ref_index":143,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models","primary_cat":"cs.CL","submitted_at":"2024-05-27T17:59:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.14720","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Defending Against Indirect Prompt Injection Attacks With Spotlighting","primary_cat":"cs.CR","submitted_at":"2024-03-20T15:26:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spotlighting prompt transformations cut indirect prompt injection success rates from >50% to <2% on GPT models while preserving task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.09227","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation","primary_cat":"cs.RO","submitted_at":"2024-03-14T09:48:36+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}