{"total":34,"items":[{"citing_arxiv_id":"2606.21517","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MedHal-Loc: Are \"Explainable-by-Architecture\" Medical Hallucination Detectors Faithful Localizers? A Localization Benchmark","primary_cat":"cs.CL","submitted_at":"2026-06-19T15:11:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MedHal-Loc benchmark shows KG-triple hallucination detectors localize errors no better than chance on controlled medical statements due to entity extraction limits, while NLI and consistency methods succeed above chance, and real hallucinations are mostly diffuse conclusion changes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19396","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"BioHarness: Substrate-Aware Evidence Assembly for Biomedical Question Answering across Literature, Knowledge Bases, and Biological Atlases","primary_cat":"q-bio.QM","submitted_at":"2026-06-17T06:25:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BioHarness improves pooled biomedical QA score from 65.9 to 71.0 on 19,302 items by using staged, substrate-aware evidence assembly that escalates only when needed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13104","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-11T09:33:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AuthorityBench shows citation presence (real or fabricated) increases LLM hallucination rates vs no-citation baseline, strongest for fabricated citations on true claims, with domain variation but negligible venue or author effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11740","ref_index":192,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA","primary_cat":"cs.CV","submitted_at":"2026-06-10T07:16:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"UniReason-Med introduces a unified framework for 2D and 3D medical VQA with shared grounded reasoning, trained on a 220K dataset, claiming that joint 2D+3D supervision improves 3D performance over 3D-only training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11675","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning","primary_cat":"cs.AI","submitted_at":"2026-06-10T05:39:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces the first structured pulmonary knowledge graph LungKG and uses it to train Lung-R1, which reaches SOTA on EMR-based pulmonary diagnosis tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20663","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DrugBench: Evaluating AI Control Protocols for Medication Harm Mitigation","primary_cat":"cs.AI","submitted_at":"2026-06-10T01:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DrugBench evaluates AI control protocols on 3,671 medical conversations for four medication harm types and finds existing protocols subvertible, proposing severity-based monitoring instead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11337","ref_index":58,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Can AI Agents Synthesize Scientific Conclusions?","primary_cat":"cs.AI","submitted_at":"2026-06-09T18:16:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11262","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry","primary_cat":"cs.LG","submitted_at":"2026-06-09T02:52:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DoRA-RBAC experiments on LLaMA-3.1-8B and Mistral-7B across QA benchmarks show geometry-aware merging offers no advantage over Euclidean averaging, indicating adapter interference stems from nonlinear representation interactions rather than parameter-space geometry.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06735","ref_index":27,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Geometric Account of Activation Steering through Angle-Norm Decomposition","primary_cat":"cs.AI","submitted_at":"2026-06-04T21:42:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Empirical study across seven language models finds concepts represented primarily in angular structure of activations while norm affects steering stability, recommending separate angular and radial parameterization over single additive coefficients.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05016","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TaDA: Calibrated Probe Gating for Task-Domain LoRA Merging","primary_cat":"cs.CL","submitted_at":"2026-06-03T15:39:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TaDA merges task-domain LoRAs via calibrated per-layer gating and subspace-aware merging, reaching 0.452 avg accuracy on six scientific QA benchmarks and 85.9% on six image classification benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05241","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation","primary_cat":"cs.CR","submitted_at":"2026-06-03T07:11:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Deep research agents exhibit widespread search-time contamination on six public benchmarks, with three defined leakage types inflating performance by up to 4%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04522","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ANN Search: Recall What Matters","primary_cat":"cs.IR","submitted_at":"2026-06-03T07:00:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ANN search quality is better assessed by 1/Ratio@k than Recall@k because the former tracks downstream task utility more closely while allowing substantially lower computational cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04127","ref_index":35,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG","primary_cat":"cs.CL","submitted_at":"2026-06-02T18:34:54+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Large-scale evaluation shows retrieval-augmented generation yields only marginal and inconsistent gains (1-2 points) over no-retrieval baselines in biomedical QA, with model choice dominating retriever or corpus effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01961","ref_index":35,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AutoMedBench: Towards Medical AutoResearch with Agentic AI Models","primary_cat":"cs.AI","submitted_at":"2026-06-01T09:22:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00686","ref_index":43,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing","primary_cat":"cs.LG","submitted_at":"2026-05-30T11:49:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29473","ref_index":26,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles","primary_cat":"cs.HC","submitted_at":"2026-05-28T07:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM support roles in Alzheimer's caregiving queries systematically alter interactional risk prevalence and composition, with directive roles rated higher in quality despite elevated risks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22963","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Graph Alignment Topology as an Inductive Bias for Grounding Detection","primary_cat":"cs.CL","submitted_at":"2026-05-21T18:49:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A GNN trained on bipartite alignment graphs between references and LLM generations reports state-of-the-art hallucination detection across four datasets, beating prior methods and GPT-4o.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22734","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-21T17:04:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ChronoMedKG builds a temporal biomedical KG with 460k evidence-linked triples across 13k diseases using LLM consensus and introduces the ChronoTQA benchmark showing RAG gains on time-sensitive questions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22080","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation","primary_cat":"cs.CV","submitted_at":"2026-05-21T07:20:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21413","ref_index":16,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work","primary_cat":"cs.AI","submitted_at":"2026-05-20T17:09:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11533","ref_index":10,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation","primary_cat":"cs.CL","submitted_at":"2026-05-12T04:58:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01048","ref_index":123,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Compared to What? Baselines and Metrics for Counterfactual Prompting","primary_cat":"cs.CL","submitted_at":"2026-05-01T19:23:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26048","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets","primary_cat":"cs.CL","submitted_at":"2026-04-28T18:33:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A graphlet-anchored framework generates 119,856 factually grounded biomedical QA pairs that improve accuracy on PubMedQA and MedQA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21304","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs","primary_cat":"cs.IR","submitted_at":"2026-04-23T05:42:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"gathering. (5) Inaccurate-Query-Keywords refers to cases where the model generates ineffective, overly generic, or misdirected retrieval queries. As a re- sult, the retrieval process repeatedly returns irrel- evant, uninformative, or empty results, indicating that the model fails to translate the information need into precise and actionable queries. (6) Shallow-Evidence-Integration arises when the model successfully retrieves relevant evidence but fails to integrate it into a coherent and well- justified answer. In these cases, the retrieved ob- servations contain the necessary information, yet the final prediction does not adequately combine, explain, or reason over the evidence, falling short of the depth required by the ground-truth label."},{"citing_arxiv_id":"2604.06650","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Parameter-Efficient Transfer Learning Approach through Multitask Prompt Distillation and Decomposition for Clinical NLP","primary_cat":"cs.CL","submitted_at":"2026-04-08T03:52:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A shared metaprompt distilled from 21 clinical tasks enables adaptation to 10 held-out datasets across five task types with under 0.05% parameters, outperforming LoRA by 1.5-1.7% and single-task prompt tuning by 6.1-6.6%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.28325","ref_index":51,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Building evidence-based knowledge bases from full-text literature for disease-specific biomedical reasoning","primary_cat":"cs.CE","submitted_at":"2026-03-30T11:53:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvidenceNet releases disease-specific biomedical knowledge bases with 7,872 and 6,622 evidence records for HCC and CRC, plus graphs, extracted via LLM pipeline with reported high fidelity.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"3 External reasoning utility validation We next evaluate whether EvidenceNet can support question answering beyond facts directly in- stantiated in the graph. For this purpose, we assemble an external yes/no benchmark by filtering HCC-related (98 samples) and CRC-related (93 samples) question-answering instances from three public biomedical QA resources, namely PubMedQA [51], BioASQ [52], and Evidence-Inference [53]. This setting is more demanding than the internal QA task because the questions are not generated from EvidenceNet itself and therefore require semantic generalization rather than direct recovery of graph-native statements. The same answering protocol and metrics defined in equation (7) are used for this evaluation."},{"citing_arxiv_id":"2603.05308","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution","primary_cat":"cs.CL","submitted_at":"2026-03-05T15:48:43+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.12805","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding","primary_cat":"q-bio.GN","submitted_at":"2026-01-19T08:06:35+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24276","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge","primary_cat":"cs.AI","submitted_at":"2025-09-29T04:38:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"G-reasoner uses QuadGraph abstraction and a 34M-parameter graph foundation model integrated with LLMs to enable scalable reasoning over diverse graph-structured knowledge, outperforming baselines on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.14427","ref_index":27,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-02-20T10:25:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.10692","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-08-20T09:42:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A regression model using attention features and recurrent uncertainty scores improves selective generation in LLMs over unsupervised and supervised baselines on ten datasets and three models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11794","ref_index":91,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DataComp-LM: In search of the next generation of training sets for language models","primary_cat":"cs.LG","submitted_at":"2024-06-17T17:42:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Questions - kaggle.com. https://www. kaggle.com/datasets/tunguz/200000-jeopardy-questions, 2019. [90] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ArXiv preprint, abs/2001.08361, 2020. URL https://arxiv.org/abs/ 2001.08361. [91] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980. [92] Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del"},{"citing_arxiv_id":"2402.03216","ref_index":100,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation","primary_cat":"cs.CL","submitted_at":"2024-02-05T17:26:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.11156","ref_index":108,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Can AI-Generated Text be Reliably Detected?","primary_cat":"cs.CL","submitted_at":"2023-03-17T17:53:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}