{"total":23,"items":[{"citing_arxiv_id":"2606.19719","ref_index":19,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Closing the Calibration Gap in Semantic Caching","primary_cat":"cs.IR","submitted_at":"2026-06-18T02:34:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces P-CHR AUC and CRR metrics to demonstrate that semantic caching model selection is limited by calibration quality rather than ranking performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11316","ref_index":50,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Sch\\\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts","primary_cat":"cs.CL","submitted_at":"2026-06-09T18:01:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02953","ref_index":180,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt","primary_cat":"cs.CL","submitted_at":"2026-06-01T23:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Larger LLMs reproduce constructional productivity via entrenchment in coercion cases with nonce words but fail to use statistical preemption to avoid overgeneralizing semantically plausible but unobserved patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20052","ref_index":58,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling","primary_cat":"cs.CL","submitted_at":"2026-05-19T16:07:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PromptRad reformulates multi-label radiology report classification as masked language modeling and enriches verbalizers with UMLS synonyms, outperforming baselines with only 32 training examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12395","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles","primary_cat":"cs.CL","submitted_at":"2026-05-12T16:57:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We selected four datasets to test CTG techniques on. Two of these are commonly used free-text generation datasets: the prompts used in the evaluation of PPLM [ 13] which we call PPLM Prompts for short, and the OpenWebText neutral sentiment prompts [17]. We add two datasets used for Story Generation, namely the Cloze Winter 2018 test set [ 49] and the STS benchmark test set [7]. We used only the 'main captions' subset of the STS benchmark test set as these are general and can be used to produce texts with all of our control attributes. We use each dataset item to generate one text 2https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english 3https://huggingface.co/michelecafagna26/t5-base-finetuned-sst2-sentiment"},{"citing_arxiv_id":"2605.12345","ref_index":24,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation","primary_cat":"cs.CL","submitted_at":"2026-05-12T16:21:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Summing outputs from separately trained QLoRA PEFT modules provides strong performance for attribute-controlled text generation, often matching or exceeding single-task modules even on single-attribute tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04901","ref_index":63,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference","primary_cat":"cs.CR","submitted_at":"2026-05-06T13:31:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00618","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus","primary_cat":"cs.CL","submitted_at":"2026-05-01T12:41:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"on the Text-to-Text Transfer Transformer architecture and pre- trained on the mC4 dataset;1024-dimensional output; • Nomic Embed 1.5( nomic-ai/nomic-embed-text- v1.5), developed by Nussbaum et al. [47], based on long (2048 tokens) context length BERT; 768-dimensional output; MTEB Multilingual STS score:59.45; • BGE-M3( BAAI/bge-m3), developed by Chen et al. [15]; 1024- dimensional output; MTEB Multilingual STS score:74.12. Each of those models is run both on original-language texts and on translated ones. 2.4 Segmentation Longer texts, such as political manifestos, typically cover multiple, diverse topics. When such texts are compared at the document-level, embeddings or other representations for distinct semantic topics be-"},{"citing_arxiv_id":"2604.19921","ref_index":39,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding","primary_cat":"cs.CL","submitted_at":"2026-04-21T19:00:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Augmenting commonsense knowledge corpora with negation produces over 2M new triples that benefit LLM negation understanding when used for pre-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18835","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring","primary_cat":"cs.CL","submitted_at":"2026-04-20T20:59:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.02764","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-12-02T13:44:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21285","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark","primary_cat":"cs.CL","submitted_at":"2025-11-26T11:18:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.18629","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"HyperAdapt: Simple High-Rank Adaptation","primary_cat":"cs.LG","submitted_at":"2025-09-23T04:29:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.16155","ref_index":71,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PRIMETIME : Limits of LLMs in Temporal Primitives","primary_cat":"cs.NE","submitted_at":"2025-04-22T17:52:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.19098","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation","primary_cat":"cs.LG","submitted_at":"2024-12-26T07:42:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SyMerge merges models via single-layer adaptation and expert-guided self-labeling to achieve task synergy, reporting SOTA results on vision, dense prediction, and NLP tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.05160","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks","primary_cat":"cs.CV","submitted_at":"2024-10-07T16:14:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.01119","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer","primary_cat":"cs.CL","submitted_at":"2024-08-02T09:00:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Task prompt vectors, formed by subtracting random initialization from tuned soft prompts, support low-resource initialization and arithmetic combination across tasks on 12 NLU datasets while remaining independent of initialization seed on two model architectures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.17428","ref_index":148,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models","primary_cat":"cs.CL","submitted_at":"2024-05-27T17:59:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.03563","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning","primary_cat":"cs.CL","submitted_at":"2024-01-07T18:12:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Data-CUBE applies a two-level curriculum (TSP-based task ordering via simulated annealing plus difficulty-sorted mini-batches) to multi-task instruction tuning and reports gains on MTEB sentence representation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.14233","ref_index":169,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Enhancing Chat Language Models by Scaling High-quality Instructional Conversations","primary_cat":"cs.CL","submitted_at":"2023-05-23T16:49:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2106.09685","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LoRA: Low-Rank Adaptation of Large Language Models","primary_cat":"cs.CL","submitted_at":"2021-06-17T17:37:18+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2104.08821","ref_index":72,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SimCSE: Simple Contrastive Learning of Sentence Embeddings","primary_cat":"cs.CL","submitted_at":"2021-04-18T11:27:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1905.00537","ref_index":88,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems","primary_cat":"cs.CL","submitted_at":"2019-05-02T00:41:50+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}