{"total":26,"items":[{"citing_arxiv_id":"2606.22179","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Score Granularity Gap in Black-Box LLM Classification: A Comparative Study of Confidence Constructions","primary_cat":"cs.CL","submitted_at":"2026-06-20T18:20:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Comparative evaluation of seven confidence constructions across 25 LLM-dataset pairs reveals that verbalized scores provide good ranking but coarse granularity for thresholding, while multi-query aggregation helps weak models but can harm strong ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28631","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection","primary_cat":"cs.LG","submitted_at":"2026-05-27T15:38:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SHIFT selects compact RLVR training subsets using the magnitude of hidden-state change from a single inference rollout plus quality-weighted farthest-first coverage, outperforming training-free baselines on math reasoning and medical QA under low budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22047","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support","primary_cat":"cs.AI","submitted_at":"2026-05-21T06:34:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21949","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-05-21T03:29:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0.92 on dev and 0.90 on test.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16679","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?","primary_cat":"cs.CL","submitted_at":"2026-05-15T22:34:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10025","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning","primary_cat":"cs.CL","submitted_at":"2026-05-11T05:49:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Tag-based few-shot selection yields higher precision and stability than random or similarity-based methods when using LLMs to analyze medical incidents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05974","ref_index":46,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts","primary_cat":"cs.CR","submitted_at":"2026-05-07T10:19:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04759","ref_index":31,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gyan: An Explainable Neuro-Symbolic Language Model","primary_cat":"cs.CL","submitted_at":"2026-05-06T11:06:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Gyan is a novel explainable non-transformer language model that achieves SOTA results on multiple datasets by mimicking human-like compositional context and world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12384","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints","primary_cat":"cs.AI","submitted_at":"2026-04-14T07:17:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Then Xℓ and Cℓ =X ℓX⊤ ℓ have the same safety subspace, where the safety subspace is defined as the set of all vectorsvsuch thatv ⊤Xℓ =0. Proof:We need to show that if v satisfies v⊤Xℓ =0 , then v also satisfies v⊤Cℓ =0 , and vice versa. Direction 1: If v⊤Xℓ =0 , then v⊤Cℓ =0 . Supposev ⊤Xℓ =0. It follows that: v⊤Cℓ =v ⊤(XℓX⊤ ℓ ) = (v⊤Xℓ)X⊤ ℓ =0·X ⊤ ℓ =0 (21) Therefore,vis also in the safety subspace defined byC ℓ. Direction 2: If v⊤Cℓ =0 , then v⊤Xℓ =0 . Suppose v⊤Cℓ =0 . Expanding this expression gives: v⊤(XℓX⊤ ℓ ) =0(22) This can be rewritten as: (v⊤Xℓ)X⊤ ℓ =0(23) Let w=v ⊤Xℓ. Then we have wX⊤ ℓ =0 , which implies: Xℓw⊤ =0(24) Taking the squared norm: ∥Xℓw⊤∥2 2 = (Xℓw⊤)⊤(Xℓw⊤) =wX ⊤ ℓ Xℓw⊤ = 0 (25)"},{"citing_arxiv_id":"2604.10316","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Comparative Analysis of Large Language Models in Healthcare","primary_cat":"cs.CL","submitted_at":"2026-04-11T18:47:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Domain-specific models like ChatDoctor excel at medically accurate and contextually reliable text while general-purpose models like Grok and LLaMA perform better on structured medical question-answering tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06846","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors","primary_cat":"cs.CL","submitted_at":"2026-04-08T09:09:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05081","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MedGemma 1.5 Technical Report","primary_cat":"cs.AI","submitted_at":"2026-04-06T18:35:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MedGemma 1.5 4B reports absolute gains of 11% on 3D MRI classification, 3% on 3D CT, 47% macro F1 on pathology slides, 35% IoU on anatomical localization, and 5-22% on clinical QA tasks over MedGemma 1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09554","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LABBench2: An Improved Benchmark for AI Systems Performing Biology Research","primary_cat":"cs.AI","submitted_at":"2026-02-04T18:50:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08549","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering","primary_cat":"cs.IR","submitted_at":"2026-01-16T09:08:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.03054","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation","primary_cat":"cs.CV","submitted_at":"2026-01-06T14:37:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.17210","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting","primary_cat":"cs.CL","submitted_at":"2025-10-20T06:50:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention-Shifting uses importance-aware suppression on unlearning data and retention enhancement on retained data via dual-loss optimization to achieve selective unlearning with better utility preservation than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.00270","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks","primary_cat":"cs.LG","submitted_at":"2025-02-01T01:52:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DUET is a global-to-local method that optimizes LLM training data mixtures via Bayesian optimization guided by influence-based selection and feedback from unseen evaluation tasks, with a regret bound showing convergence to the optimal mixture.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.05465","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)","primary_cat":"cs.CL","submitted_at":"2025-01-03T19:53:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"While task agnostic SLMs offers a broad range of knowledge and reasoning capabilities, industry vertical-based SLMs excel in higher accuracy in specific contexts and efficient for industry-specific tasks. We discuss some of the vertical specific SLMs in this section. 2.3.1 Medical Domain: BioGPT[43] is an SLM in medical domain with fine-tuning on PubMedQA dataset[59], created using generative data augmentation technique, outperforms few-shot GPT-4. Efficient fine-tuning with Low-Rank Adaptation (LoRA) to capture the essential characteristics of the data and adapt to domain-specific tasks proved to be effective in creating BioGPT. Interestingly several SLM pa- pers in the medical domain focused on augmenting data using LLMs [44, 159,"},{"citing_arxiv_id":"2412.18925","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","primary_cat":"cs.CL","submitted_at":"2024-12-25T15:12:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.13903","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment","primary_cat":"cs.CR","submitted_at":"2024-10-16T08:14:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CoreGuard introduces a computation- and communication-efficient protocol claimed to deliver upper-bound security against model stealing for edge-deployed LLMs with negligible overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.07887","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"An Empirical Study of Mamba-based Language Models","primary_cat":"cs.LG","submitted_at":"2024-06-12T05:25:15+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.07960","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments","primary_cat":"cs.HC","submitted_at":"2024-05-13T17:38:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.03216","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation","primary_cat":"cs.CL","submitted_at":"2024-02-05T17:26:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.02458","ref_index":130,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Data-Centric Foundation Models in Computational Healthcare: A Survey","primary_cat":"cs.LG","submitted_at":"2024-01-04T08:00:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to process sequential inputs, leading to large pre-trained models for protein sequence tasks. ESM- 2 [165] is a Transformer model with 15B parameters pre-trained on millions of protein sequences to predict protein structure directly from amino acid sequences, fast and accurately. ESM-2 illustrates the immense potential of LLMs to learn patterns in protein sequences across evolution. AlphaMis- sense [50] pre-trains an AlphaFold-like model [130] to predict protein structure via protein language modeling. It then fine-tunes the model with an additional variant pathogenicity classification ob- jective on human and primate variant population frequency databases. AlphaMissense achieves state-of-the-art performance on missense variant pathogenicity prediction. 3.2.2 Model initialization. Healthcare FM pre-training benefits from utilizing general FMs as the"},{"citing_arxiv_id":"2303.13375","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Capabilities of GPT-4 on Medical Challenge Problems","primary_cat":"cs.CL","submitted_at":"2023-03-20T16:18:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.09085","ref_index":183,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Galactica: A Large Language Model for Science","primary_cat":"cs.CL","submitted_at":"2022-11-16T18:06:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}