ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
super hub Mixed citations
Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks
Mixed citation behavior. Most common role is background (54%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
Introduces P-CHR AUC and CRR metrics to demonstrate that semantic caching model selection is limited by calibration quality rather than ranking performance.
DICE aggregates independently encoded document chunks into a single vector to reduce evidence dilution in long-document dense retrieval, reporting gains on LongEmbed especially beyond 4k tokens.
BELLS-O is the first vendor-neutral operational benchmark comparing specialized guardrails and repurposed frontier LLMs on accuracy, false-positive rates, speed, and monetary cost across 11 harm categories and 13 jailbreak techniques.
Unsupervised style representations learned via paraphrase inversion enable competitive few-shot and zero-shot AI-text detection with better generalization to unseen LLMs than supervised baselines.
Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.
CaLIR learns continuous latent intent states guided by product category hierarchies for generative retrieval, combining hierarchical reasoning and dynamic prefix tries to balance effectiveness and low-latency inference on multilingual e-commerce data.
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
SCARA introduces a four-stage pipeline using state-aware verification and constrained synthesis to remediate vulnerabilities in source-unavailable industrial software, reporting 100% precision and 88.9% success on a 15-case benchmark.
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Semantic search retrieves substantially more implicit receptions of Locke's work than lexical baselines in 18th-century corpora, yet remains constrained by lexical gatekeeping.
Test-time scaling for personalized LLMs follows a logarithmic utility curve under oracle selection but standard reward models suffer user-level collapse and query-level hacking; a probabilistic reward model with learned variance enables consistent scaling.
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
CircuitFormer is a 511M-parameter encoder-decoder model that generates analog circuit topologies from text prompts at 100% syntactic correctness and 83% functional success using a new subcircuit-mining tokenizer that keeps vocabulary size fixed at 512.
CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking models, outperforming baselines on MS MARCO and TREC benchmarks with cross-architecture
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
Cross-cultural survey of 4,641 participants shows LLM emotional support adoption varies widely by country and demographics, with socioeconomic status as strongest predictor of trust and use, and English-speaking nations more accepting than others in Europe.
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
citing papers explorer
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
SimCSE: Simple Contrastive Learning of Sentence Embeddings
SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
-
CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
-
Closing the Calibration Gap in Semantic Caching
Introduces P-CHR AUC and CRR metrics to demonstrate that semantic caching model selection is limited by calibration quality rather than ranking performance.
-
Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation
DICE aggregates independently encoded document chunks into a single vector to reduce evidence dilution in long-document dense retrieval, reporting gains on LongEmbed especially beyond 4k tokens.
-
BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems
BELLS-O is the first vendor-neutral operational benchmark comparing specialized guardrails and repurposed frontier LLMs on accuracy, false-positive rates, speed, and monetary cost across 11 harm categories and 13 jailbreak techniques.
-
Unsupervised Style Representation Learning for AI-Text Detection via Paraphrase Inversion
Unsupervised style representations learned via paraphrase inversion enable competitive few-shot and zero-shot AI-text detection with better generalization to unseen LLMs than supervised baselines.
-
Continuous Language Diffusion as a Decoder-Interface Problem
Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.
-
Beyond Matching: Category-Guided Latent Intent Reasoning for Generative Retrieval in E-Commerce
CaLIR learns continuous latent intent states guided by product category hierarchies for generative retrieval, combining hierarchical reasoning and dynamic prefix tries to balance effectiveness and low-latency inference on multilingual e-commerce data.
-
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
-
SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
-
EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision
EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.
-
Low-Resource Safety Failures Are Action Failures, Not Representation Failures
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
-
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
-
SCARA: A Semantics-Constrained Autonomous Remediation Agent for Opaque Industrial Software Vulnerabilities
SCARA introduces a four-stage pipeline using state-aware verification and constrained synthesis to remediate vulnerabilities in source-unavailable industrial software, reporting 100% precision and 88.9% success on a 15-case benchmark.
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
-
Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke
Semantic search retrieves substantially more implicit receptions of Locke's work than lexical baselines in 18th-century corpora, yet remains constrained by lexical gatekeeping.
-
Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
Test-time scaling for personalized LLMs follows a logarithmic utility curve under oracle selection but standard reward models suffer user-level collapse and query-level hacking; a probabilistic reward model with learned variance enables consistent scaling.
-
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
-
CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt
CircuitFormer is a 511M-parameter encoder-decoder model that generates analog circuit topologies from text prompts at 100% syntactic correctness and 83% functional success using a new subcircuit-mining tokenizer that keeps vocabulary size fixed at 512.
-
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models
CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking models, outperforming baselines on MS MARCO and TREC benchmarks with cross-architecture
-
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
-
From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support
Cross-cultural survey of 4,641 participants shows LLM emotional support adoption varies widely by country and demographics, with socioeconomic status as strongest predictor of trust and use, and English-speaking nations more accepting than others in Europe.
-
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
-
Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers
Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expansion insufficient to fix it.
-
Bounded Autonomy: Controlling LLM Characters in Live Multiplayer Games
Bounded autonomy is a new control architecture that makes LLM characters workable in live multiplayer games by combining interaction stability techniques, action grounding, and lightweight player steering, validated through deployment and analysis.
-
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
-
XNote: Benchmarking Automated Community Notes Generation for Image-based Contextual Deception
The XNote dataset and LVLM benchmarks demonstrate that current models face significant challenges in generating accurate, grounded Community Notes for image-based contextual deception.
-
Spectral Tempering for Embedding Compression in Dense Passage Retrieval
Spectral Tempering derives an adaptive scaling factor γ(k) from the embedding eigenspectrum via local SNR analysis and knee-point normalization to achieve near-optimal compression without training or validation.
-
Form Without Function: Agent Social Behavior in the Moltbook Network
In an AI-agent social network, the structural form of social media is fully present but genuine social functions like reciprocity and argumentation are largely absent.
-
Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala
Language models degrade over 300 times in performance on Romanized Sinhala versus Unicode, with model size showing no correlation to script robustness.
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
-
From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors
Crowdsourced metaphors show rising anthropomorphism and warmth toward AI that predict trust and adoption, with notable demographic differences.
-
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA
Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.
-
Agentic Analysis for Agentic Infrastructure: An LLM-Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols
An LLM-assisted pipeline applied to 4,323 governance records finds that permissionless and corporate AI agent protocols show similar participation inequality and fragmentation but denser thematic alignment in the open setting.
-
A Grounded Evidence-Retrieval Benchmark and Hybrid RAG Framework for Silicon Pixel Detector R&D
Presents the first evidence-grounded retrieval benchmark and hybrid RAG framework for silicon pixel detector R&D, with evaluation showing hybrid sparse-dense retrieval most reliable for evidence recovery.
-
What Do Language Priors Contribute to Darcy-Flow Inversion? A Mechanistic Audit
Sentence embeddings reduce reconstruction error by 81% in Darcy-flow inversion by providing categorical geological constraints, with limited added value from within-class text detail.
-
CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression
Input compression increases net cost and reduces accuracy as models respond longer; output compression reduces cost on most models, with surface text diverging from unconstrained references on non-reasoning models.
-
Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior
Data-similarity and data-influence produce significantly overlapping rankings of training documents for LLM outputs, with asymmetry allowing a favorable cost-accuracy trade-off.
-
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
-
SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG
SproutRAG introduces an attention-guided hierarchical framework that constructs a binary chunking tree for multi-granularity retrieval in RAG systems and reports a 6.1% average gain in information efficiency.
-
MindAlign: Decoding Inner Speech from fMRI Signals via Multimodal Embedding Alignment under Limited Data
MindAlign decodes inner speech from fMRI via subject-specific neural-semantic alignment into a multimodal space followed by prompting of a frozen LM, outperforming baselines and generalizing across subjects.
-
Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR
Introduces ShopTrajQA long-context benchmark and an RLVR-trained tool-augmented agent that bypasses LLM context limits by external file storage and code-based retrieval for shopping trajectories.
-
Empirical Study on the Characteristics and Evolution of AI-usage in GitHub Repositories: Evidence from Code Comments
Analysis of 35k AI-referencing GitHub comments shows primary use for code implementation, with evolution toward conceptual support and sustained human refinement over time.
-
Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions
Introduces a matched four-condition protocol and ONCU metric to diagnose evidence utilization in long-context and RAG models across synthetic and multi-hop QA tasks.
-
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
-
Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models
Clustered Self-Assessment groups sampled LLM responses into semantic clusters, presents clusters as multiple-choice options, and uses the LLM's assigned probabilities to those options as direct uncertainty estimates, outperforming entropy baselines with as few as two extra samples.
-
GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations
GLINT introduces sparsely gated alignment and dense feature regularization on top of DINOv3 and V-JEPA encoders to enable query-specific zero-shot grounding and segmentation in 2D CXR and 3D CT.