ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
super hub Mixed citations
Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks
Mixed citation behavior. Most common role is background (54%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.
Retrieval coverage limits LLM rerankers in cold-start recommendation; a learned hybrid fusion improves pool quality but LLM reranking often degrades end-to-end performance while simpler rankers exploit the pool.
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
Introduces P-CHR AUC and CRR metrics to demonstrate that semantic caching model selection is limited by calibration quality rather than ranking performance.
DICE aggregates independently encoded document chunks into a single vector to reduce evidence dilution in long-document dense retrieval, reporting gains on LongEmbed especially beyond 4k tokens.
BELLS-O is the first vendor-neutral operational benchmark comparing specialized guardrails and repurposed frontier LLMs on accuracy, false-positive rates, speed, and monetary cost across 11 harm categories and 13 jailbreak techniques.
Unsupervised style representations learned via paraphrase inversion enable competitive few-shot and zero-shot AI-text detection with better generalization to unseen LLMs than supervised baselines.
Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.
CaLIR learns continuous latent intent states guided by product category hierarchies for generative retrieval, combining hierarchical reasoning and dynamic prefix tries to balance effectiveness and low-latency inference on multilingual e-commerce data.
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
DirectorBench is a profile-aware diagnostic benchmark that localizes bottlenecks in long-form video generation workflows using structured checkpoints and multi-agent evaluation.
Formalizes continual model routing (CMR), releases CMRBench with over 2000 models, and presents CARvE which outperforms retrieval, fine-tuning and adapter-merging baselines on model/family/domain accuracy.
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
SCARA introduces a four-stage pipeline using state-aware verification and constrained synthesis to remediate vulnerabilities in source-unavailable industrial software, reporting 100% precision and 88.9% success on a 15-case benchmark.
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Large-scale analysis of unfiltered search queries shows geospatial intent at 18% of total, dominated by transactional categories outside traditional GIS scope.
Semantic search retrieves substantially more implicit receptions of Locke's work than lexical baselines in 18th-century corpora, yet remains constrained by lexical gatekeeping.
citing papers explorer
-
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models
CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking models, outperforming baselines on MS MARCO and TREC benchmarks with cross-architecture
-
Safety Context Injection: Inference-Time Safety Alignment via Static Filtering and Agentic Analysis
Safety Context Injection prepends structured external risk reports via static or agentic analysis to lower attack success rates and toxicity in reasoning models on AdvBench and GPTFuzz benchmarks.
-
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
-
Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation
Retina-RAG combines a retinal classifier, LoRA-tuned Qwen2.5-VL, and RAG to jointly grade DR, detect ME, and generate reports, reaching F1 scores of 0.731 and 0.948 while exceeding baselines on ROUGE-L and SBERT metrics.