pith. machine review for the scientific record. sign in

arxiv: 1908.10084 · v1 · submitted 2019-08-27 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:47 UTC · model grok-4.3

classification 💻 cs.CL
keywords sentence embeddingsBERTsiamese networkstriplet networkssemantic textual similaritycosine similaritytransfer learning
0
0 comments X

The pith

Sentence-BERT uses siamese and triplet training on BERT to create fixed sentence embeddings that support fast cosine-similarity comparisons while matching original accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sentence-BERT by adapting pretrained BERT with siamese and triplet network structures. This produces standalone sentence embeddings that capture semantic meaning and can be compared directly with cosine similarity. The change removes the need to run both sentences through the model together for each comparison. As a result, finding the most similar pair among 10,000 sentences drops from roughly 65 hours of BERT inference to about 5 seconds. A sympathetic reader cares because the approach makes semantic search and clustering practical at scale while preserving BERT-level performance on standard sentence-pair tasks.

Core claim

Sentence-BERT modifies the pretrained BERT network by applying siamese and triplet network structures to derive semantically meaningful sentence embeddings. These embeddings can be compared using cosine similarity. The method reduces the computational cost of finding the most similar pair in a collection of 10,000 sentences from approximately 50 million inference computations with BERT to a few seconds with SBERT, while maintaining the accuracy achieved by the original BERT model on semantic textual similarity tasks.

What carries the argument

Siamese and triplet network structures applied to BERT for producing standalone sentence embeddings.

If this is right

  • Semantic similarity search over large sentence collections becomes feasible in seconds rather than hours.
  • Unsupervised tasks such as clustering become practical with BERT-derived embeddings.
  • SBERT and SRoBERTa outperform prior state-of-the-art sentence embedding methods on standard STS benchmarks and transfer learning tasks.
  • The same accuracy as full BERT pairwise inference is retained on sentence-pair regression tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same siamese training approach could be applied to other pretrained transformer models to generate efficient embeddings.
  • Independent sentence embeddings may serve as a practical approximation for many semantic comparison tasks that originally required joint inference.
  • Combining SBERT-style embeddings with domain-specific fine-tuning could further improve performance on specialized corpora without reintroducing pairwise computation costs.

Load-bearing premise

Fine-tuning BERT with siamese and triplet networks produces sentence embeddings whose cosine similarities accurately reflect semantic similarity at the level of the original pairwise BERT inference.

What would settle it

A held-out semantic textual similarity dataset where the ranking of sentence pairs by SBERT cosine similarity differs substantially from the ranking obtained by direct BERT pairwise inference on the same pairs.

read the original abstract

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sentence-BERT (SBERT), a modification of pre-trained BERT (and RoBERTa) that employs siamese and triplet network structures to produce fixed-length sentence embeddings. These embeddings can be compared efficiently via cosine similarity, reducing the cost of finding the most similar pair among 10,000 sentences from ~65 hours (pairwise BERT inference) to ~5 seconds while claiming to maintain BERT-level accuracy on semantic textual similarity (STS) tasks and to outperform prior sentence embedding methods on transfer learning tasks.

Significance. If the empirical claims hold, the work is significant because it makes contextualized transformer representations practical for large-scale semantic search, clustering, and retrieval pipelines that were previously infeasible due to quadratic inference costs. The approach has influenced subsequent efficient embedding research and provides a reproducible recipe for adapting pre-trained models to standalone sentence encoding.

major comments (2)
  1. [§3] §3 (SBERT Architecture): The central claim that siamese/triplet fine-tuning on NLI data produces embeddings whose cosine similarities recover the semantic judgments of BERT's joint [CLS] encoding is load-bearing for the 'maintaining the accuracy' assertion, yet the manuscript provides no ablation or diagnostic test on phenomena that rely on cross-sentence attention (e.g., negation scope, coreference resolution, or subtle entailment). A controlled comparison on such cases would be required to substantiate that independent encoding plus learned pooling fully compensates for the removed token-level interactions.
  2. [§4] §4 (Evaluation): The STS and transfer-task results are presented without reporting run-to-run variance, statistical significance tests, or direct side-by-side numbers for the original BERT/RoBERTa pairwise baseline on the identical splits and metrics; this weakens the quantitative support for the efficiency-accuracy tradeoff claim.
minor comments (2)
  1. [Abstract] Abstract: subject-verb agreement error ('BERT and RoBERTa has set') and subject-verb mismatch ('that use siamese').
  2. [§3] Notation: the pooling operation (mean/max/[CLS]) and the exact form of the triplet loss are described but not given explicit equations; adding numbered equations would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below, clarifying our position and indicating changes to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (SBERT Architecture): The central claim that siamese/triplet fine-tuning on NLI data produces embeddings whose cosine similarities recover the semantic judgments of BERT's joint [CLS] encoding is load-bearing for the 'maintaining the accuracy' assertion, yet the manuscript provides no ablation or diagnostic test on phenomena that rely on cross-sentence attention (e.g., negation scope, coreference resolution, or subtle entailment). A controlled comparison on such cases would be required to substantiate that independent encoding plus learned pooling fully compensates for the removed token-level interactions.

    Authors: We agree that phenomena relying on cross-sentence attention represent an important test case for the claim that SBERT embeddings recover BERT-level semantic judgments. The NLI training data used for fine-tuning explicitly requires modeling entailment and contradiction relations, which frequently involve negation, coreference, and subtle semantic distinctions. The strong results on STS benchmarks, which contain many such examples, provide supporting evidence that the learned pooling and siamese objective capture the necessary information in the fixed embeddings. Nevertheless, the original manuscript did not include targeted diagnostic ablations or controlled comparisons isolating these phenomena. In the revised version we have added a paragraph in §3 discussing this point, along with qualitative examples illustrating SBERT's handling of negation and coreference in similarity tasks. A full controlled study would require new experiments outside the scope of the current work focused on efficient sentence encoding. revision: partial

  2. Referee: [§4] §4 (Evaluation): The STS and transfer-task results are presented without reporting run-to-run variance, statistical significance tests, or direct side-by-side numbers for the original BERT/RoBERTa pairwise baseline on the identical splits and metrics; this weakens the quantitative support for the efficiency-accuracy tradeoff claim.

    Authors: The BERT and RoBERTa pairwise numbers reported in the paper are taken directly from the same standard STS and transfer-task benchmarks and splits used in the original BERT/RoBERTa publications and subsequent leaderboard evaluations, enabling direct comparison on identical metrics. To strengthen the presentation, we have updated the evaluation section and tables to report run-to-run standard deviations (computed over five random seeds) for SBERT and SRoBERTa, and we have added paired statistical significance tests against the strongest baselines. The side-by-side BERT/RoBERTa figures already appear in Tables 1 and 2 using the same evaluation protocol. revision: yes

standing simulated objections not resolved
  • A dedicated controlled ablation isolating cross-sentence attention phenomena (negation scope, coreference, subtle entailment) was not performed in the original experiments.

Circularity Check

0 steps flagged

No circularity: SBERT is an empirical fine-tuning method with external validation

full rationale

The paper describes a practical modification of BERT using siamese and triplet networks to produce fixed sentence embeddings for cosine similarity, followed by direct evaluation on STS and transfer tasks. No derivation chain exists that reduces a claimed result to its own inputs by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or imported uniqueness theorems appear. The core claim rests on standard transfer learning from an external pre-trained model (BERT) and is tested against independent benchmarks, making the procedure self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that siamese and triplet fine-tuning successfully transfers BERT's semantic capabilities to independent sentence embeddings; no free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption BERT's pre-trained representations can be adapted via siamese and triplet training to produce standalone sentence embeddings that preserve semantic information.
    This is the core unproven premise enabling the efficiency gain.

pith-pipeline@v0.9.0 · 5496 in / 1182 out tokens · 70372 ms · 2026-05-10T14:47:49.156617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Unified Geometric Framework for Weighted Contrastive Learning

    cs.LG 2026-05 unverdicted novelty 8.0

    Weighted InfoNCE objectives realize specific target geometries in embedding space, with SupCon producing size-dependent inter-class similarities under imbalance while Soft SupCon and certain continuous variants preser...

  2. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  3. Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

    cs.CR 2026-04 conditional novelty 8.0

    Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

  4. PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users

    cs.CL 2026-05 unverdicted novelty 7.0

    Preference fine-tuning outperforms prompting for personalisation but amplifies sycophancy and relationship-seeking, while simulated users recover aggregate rankings yet show far lower self-consistency and different to...

  5. DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

    cs.AI 2026-05 unverdicted novelty 7.0

    DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...

  6. PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

    cs.HC 2026-05 unverdicted novelty 7.0

    Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

  7. Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs

    cs.CR 2026-05 unverdicted novelty 7.0

    PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.

  8. TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

    cs.CL 2026-05 unverdicted novelty 7.0

    TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.

  9. Automated Large-scale CVRP Solver Design via LLM-assisted Flexible MCTS

    cs.AI 2026-05 unverdicted novelty 7.0

    LaF-MCTS uses LLM-assisted flexible MCTS with a three-tier hierarchy, semantic pruning, and branch regrowth to automatically compose decomposition-enhanced CVRP solvers that outperform state-of-the-art methods on CVRP...

  10. ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

    cs.CL 2026-05 unverdicted novelty 7.0

    ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

  11. RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

    cs.SE 2026-04 unverdicted novelty 7.0

    RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...

  12. Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...

  13. Similar Users-Augmented Interest Network

    cs.IR 2026-04 unverdicted novelty 7.0

    SUIN improves CTR prediction by augmenting target user sequences with similar users' behaviors via embedding-based retrieval, user-specific position encoding, and user-aware target attention.

  14. Prompt-Unknown Promotion Attacks against LLM-based Sequential Recommender Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    PUDA enables effective promotion of unpopular target items in black-box LLM sequential recommenders by using evolutionary LLM refinement to infer hidden prompts, training a surrogate model, and combining adversarial t...

  15. R2Code: A Self-Reflective LLM Framework for Requirements-to-Code Traceability

    cs.SE 2026-04 unverdicted novelty 7.0

    R2Code improves requirement-to-code traceability with a bidirectional alignment network, self-reflective consistency verification, and dynamic context-adaptive retrieval, yielding 7.4% average F1 gain and up to 41.7% ...

  16. Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation

    cs.IR 2026-04 unverdicted novelty 7.0

    An LLM simulation framework generates multilingual tip-of-the-tongue queries, validated by rank correlation with real queries, producing the first large-scale ToT benchmarks for four languages.

  17. Semantic Recall for Vector Search

    cs.IR 2026-04 unverdicted novelty 7.0

    Semantic Recall is a new evaluation metric for approximate nearest neighbor search that focuses only on semantically relevant results, with Tolerant Recall as a proxy when relevance labels are unavailable.

  18. HumanScore: Benchmarking Human Motions in Generated Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps betwee...

  19. LLM-Viterbi: Semantic-Aware Decoding for Convolutional Codes

    cs.IT 2026-04 unverdicted novelty 7.0

    An LLM-enhanced Viterbi decoder achieves roughly 1.5 dB extra coding gain in block error rate and over 50% better semantic similarity than conventional Viterbi for constraint-length-3 convolutional codes on AWGN channels.

  20. DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion

    cs.IR 2026-04 conditional novelty 7.0

    Adaptive trie-guided decoding with document context and tunable penalties improves in-document query auto-completion, outperforming baselines and larger models like LLaMA-3 on seen queries.

  21. Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval

    cs.IR 2026-04 unverdicted novelty 7.0

    BAGEL is a Bayesian active learning framework that uses Gaussian Processes to propagate LLM relevance signals across embedding space and guide global exploration, outperforming standard LLM reranking under identical b...

  22. mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.

  23. Efficient Personalization of Generative User Interfaces

    cs.LG 2026-04 unverdicted novelty 7.0

    A dataset revealing high inter-designer disagreement on UI preferences motivates a sample-efficient method that personalizes generative interfaces by embedding new users in the space of prior designers, outperforming ...

  24. Skill-Conditioned Visual Geolocation for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.

  25. Skill-Conditioned Visual Geolocation for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...

  26. Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

    cs.CR 2026-04 unverdicted novelty 7.0

    HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.

  27. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    cs.CL 2024-02 unverdicted novelty 7.0

    M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...

  28. C-Pack: Packed Resources For General Chinese Embeddings

    cs.CL 2023-09 accept novelty 7.0

    C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

  29. Steering Language Models With Activation Engineering

    cs.CL 2023-08 unverdicted novelty 7.0

    Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.

  30. Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.

  31. SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

    cs.AI 2026-05 unverdicted novelty 6.0

    SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...

  32. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

    cs.AI 2026-05 unverdicted novelty 6.0

    ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.

  33. Sanity Checks for Long-Form Hallucination Detection

    cs.CL 2026-05 unverdicted novelty 6.0

    Hallucination detectors on LLM reasoning traces often rely on final-answer artifacts rather than reasoning validity; once controlled, lightweight lexical trajectory features suffice for robust detection.

  34. WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.

  35. Structural Rationale Distillation via Reasoning Space Compression

    cs.CL 2026-05 unverdicted novelty 6.0

    D-RPC compresses reasoning into a dynamic bank of reusable paths to produce consistent teacher rationales, outperforming standard distillation baselines on five reasoning benchmarks while using fewer tokens.

  36. RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation

    cs.IR 2026-05 unverdicted novelty 6.0

    RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.

  37. Query-efficient model evaluation using cached responses

    cs.LG 2026-05 unverdicted novelty 6.0

    DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

  38. On the Role of Language Representations in Auto-Bidding: Findings and Implications

    cs.AI 2026-05 unverdicted novelty 6.0

    SemBid injects LLM-encoded Task, History, and Strategy semantics as tokens into offline bidding trajectories and uses self-attention to outperform numerical-only baselines in performance, constraint satisfaction, and ...

  39. PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

    cs.HC 2026-05 unverdicted novelty 6.0

    PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...

  40. You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

    cs.CR 2026-05 unverdicted novelty 6.0

    NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...

  41. Anticipating Innovation Using Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    TechToken uses transformer embeddings of IPC codes to measure linguistic convergence in patents and predict future technological combinations.

  42. Revisiting Graph-Tokenizing Large Language Models: A Systematic Evaluation of Graph Token Understanding

    cs.CL 2026-05 unverdicted novelty 6.0

    GTokenLLMs do not fully understand graph tokens, exhibiting over-sensitivity or insensitivity to instruction changes and relying heavily on text for reasoning even when graph information is preserved.

  43. RECAP: An End-to-End Platform for Capturing, Replaying, and Analyzing AI-Assisted Programming Interactions

    cs.SE 2026-05 unverdicted novelty 6.0

    RECAP captures, replays, and analyzes AI-assisted programming sessions by linking prompts, edits, and developer actions in a single timeline.

  44. A Replicability Study of XTR

    cs.IR 2026-05 accept novelty 6.0

    XTR training does not improve retrieval effectiveness over ColBERT but enhances IVF engine efficiency by flattening token scores to produce more discriminative centroids.

  45. From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

    cs.AI 2026-04 unverdicted novelty 6.0

    Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.

  46. Make Any Collection Navigable: Methods for Constructing and Evaluating Hypergraph of Text

    cs.IR 2026-04 unverdicted novelty 6.0

    Methods for constructing Hypergraphs of Text are proposed with a new effort ratio metric where TF-IDF baselines match LLM methods in experiments.

  47. LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images

    cs.CV 2026-04 unverdicted novelty 6.0

    LatentDiff scales semantic dataset comparison to millions of images using latent spaces of vision encoders combined with sparse autoencoders and density ratio estimation, showing better accuracy and robustness than ca...

  48. MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

    cs.CL 2026-04 unverdicted novelty 6.0

    MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme lo...

  49. When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    Hallucinations in LVLMs largely arise from textual priors in prompts, and can be reduced by fine-tuning with preference optimization on grounded vs. hallucinated response pairs.

  50. COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.

  51. Text Steganography with Dynamic Codebook and Multimodal Large Language Model

    cs.CR 2026-04 unverdicted novelty 6.0

    A black-box text steganography method using a dynamic codebook generated by multimodal LLMs and reject-sampling feedback achieves higher embedding capacity and text quality than prior white-box and fixed-codebook blac...

  52. Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs show mixed results on authorship verification, post generation, and attribute inference from Twitter data, with new frameworks and user studies establishing benchmarks for these analytics tasks.

  53. Reasoning Structure Matters for Safety Alignment of Reasoning Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

  54. HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...

  55. Identifying Ethical Biases in Action Recognition Models

    cs.CV 2026-04 unverdicted novelty 6.0

    The authors create a synthetic video auditing framework that detects statistically significant skin color biases in popular human action recognition models even when actions are identical.

  56. DuConTE: Dual-Granularity Text Encoder with Topology-Constrained Attention for Text-attributed Graphs

    cs.CL 2026-04 unverdicted novelty 6.0

    DuConTE is a dual-granularity text encoder that incorporates graph topology into language model attention for improved node representations in text-attributed graphs.

  57. REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning

    cs.CL 2026-04 unverdicted novelty 6.0

    REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.

  58. Lorentz Framework for Semantic Segmentation

    cs.CV 2026-04 unverdicted novelty 6.0

    A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3...

  59. UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels

    cs.LG 2026-04 unverdicted novelty 6.0

    UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.

  60. Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

    cs.LG 2026-04 unverdicted novelty 6.0

    RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 111 Pith papers · 3 internal anchors

  1. [1]

    Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. http://www.aclweb.org/anthology/S15-2045 SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability . In Procee...

  2. [2]

    Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. https://doi.org/10.3115/v1/S14-2010 S em E val-2014 Task 10: Multilingual Semantic Textual Similarity . In Proceedings of the 8th International Workshop on Semantic Evaluation ( S em E val 2014) , pages ...

  3. [3]

    Cer, Mona T

    Eneko Agirre, Carmen Banea, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez - Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. http://aclweb.org/anthology/S/S16/S16-1081.pdf SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation . In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@...

  4. [4]

    Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. https://www.aclweb.org/anthology/S13-1004 * SEM 2013 shared task: Semantic Textual Similarity . In Second Joint Conference on Lexical and Computational Semantics (* SEM ), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity , pages 3...

  5. [5]

    Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. http://dl.acm.org/citation.cfm?id=2387636.2387697 SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity . In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedin...

  6. [6]

    and Angeli, Gabor and Potts, Christopher and Manning, Christopher D

    Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A large annotated corpus for learning natural language inference . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632--642, Lisbon, Portugal. Association for Computational Linguistics

  7. [7]

    Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. http://arxiv.org/abs/1708.00055 SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation . In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1--14, Vancouver, Canada

  8. [8]

    Universal Sentence Encoder

    Daniel Cer, Yinfei Yang, Sheng - yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo - Cespedes, Steve Yuan, Chris Tar, Yun - Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. http://arxiv.org/abs/1803.11175 Universal Sentence Encoder . arXiv preprint arXiv:1803.11175

  9. [9]

    Alexis Conneau and Douwe Kiela. 2018. https://arxiv.org/abs/1803.05449 SentEval: An Evaluation Toolkit for Universal Sentence Representations . arXiv preprint arXiv:1803.05449

  10. [10]

    Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo\" i c Barrault, and Antoine Bordes. 2017. https://www.aclweb.org/anthology/D17-1070 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670--680, Copenhagen, Denmark. ...

  11. [11]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. https://arxiv.org/abs/1810.04805 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . arXiv preprint arXiv:1810.04805

  12. [12]

    Bill Dolan, Chris Quirk, and Chris Brockett. 2004. https://doi.org/10.3115/1220355.1220406 Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources . In Proceedings of the 20th International Conference on Computational Linguistics, COLING '04, Stroudsburg, PA, USA. Association for Computational Linguistics

  13. [13]

    Liat Ein Dor, Yosi Mass, Alon Halfon, Elad Venezian, Ilya Shnayderman, Ranit Aharonov, and Noam Slonim. 2018. https://doi.org/10.18653/v1/P18-2009 Learning Thematic Similarity Metric from Article Sections Using Triplet Networks . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 49--...

  14. [14]

    Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. https://doi.org/10.18653/v1/N16-1162 Learning Distributed Representations of Sentences from Unlabelled Data . In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1367--1377, San Diego, California. Assoc...

  15. [15]

    Minqing Hu and Bing Liu. 2004. https://doi.org/10.1145/1014052.1014073 Mining and Summarizing Customer Reviews . In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, pages 168--177, New York, NY, USA. ACM

  16. [16]

    Samuel Humeau, Kurt Shuster, Marie - Anne Lachaux, and Jason Weston. 2019. http://arxiv.org/abs/1905.01969 Real-time Inference in Multi-sentence Tasks with Deep Pretrained Transformers . arXiv preprint arXiv:1905.01969, abs/1905.01969

  17. [17]

    Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. 2017. https://arxiv.org/abs/1702.08734 Billion-scale similarity search with GPUs . arXiv preprint arXiv:1702.08734

  18. [18]

    Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf Skip-Thought Vectors . In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3294--3302. Curra...

  19. [19]

    Xin Li and Dan Roth. 2002. https://doi.org/10.3115/1072228.1072378 Learning Question Classifiers . In Proceedings of the 19th International Conference on Computational Linguistics - Volume 1, COLING '02, pages 1--7, Stroudsburg, PA, USA. Association for Computational Linguistics

  20. [20]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. http://arxiv.org/abs/1907.11692 RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv preprint arXiv:1907.11692

  21. [21]

    Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf A SICK cure for the evaluation of compositional distributional semantic models . In Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC ' ...

  22. [22]

    Bowman and Rachel Rudinger , title =

    Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. http://arxiv.org/abs/1903.10561 On Measuring Social Biases in Sentence Encoders . arXiv preprint arXiv:1903.10561

  23. [23]

    Amita Misra, Brian Ecker, and Marilyn A. Walker. 2016. http://aclweb.org/anthology/W/W16/W16-3636.pdf Measuring the Similarity of Sentential Arguments in Dialogue . In Proceedings of the SIGDIAL 2016 Conference, The 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 13-15 September 2016, Los Angeles, CA, USA , pages 276--287

  24. [24]

    Bo Pang and Lillian Lee. 2004. https://doi.org/10.3115/1218955.1218990 A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts . In Proceedings of the 42nd Meeting of the Association for Computational Linguistics ( ACL ' 04), Main Volume , pages 271--278, Barcelona, Spain

  25. [25]

    Bo Pang and Lillian Lee. 2005. https://doi.org/10.3115/1219840.1219855 Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales . In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics ( ACL ' 05) , pages 115--124, Ann Arbor, Michigan. Association for Computational Linguistics

  26. [26]

    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. https://www.aclweb.org/anthology/D14-1162 GloVe: Global Vectors for Word Representation . In Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543

  27. [27]

    Yifan Qiao, Chenyan Xiong, Zheng - Hao Liu, and Zhiyuan Liu. 2019. http://arxiv.org/abs/1904.07531 Understanding the Behaviors of BERT in Ranking . arXiv preprint arXiv:1904.07531

  28. [28]

    Nils Reimers, Philip Beyer, and Iryna Gurevych. 2016. https://www.aclweb.org/anthology/C16-1009 Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity . In Proceedings of the 26th International Conference on Computational Linguistics (COLING), pages 87--96

  29. [29]

    Nils Reimers and Iryna Gurevych. 2018. http://arxiv.org/abs/1803.09578 Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches . arXiv preprint arXiv:1803.09578, abs/1803.09578

  30. [30]

    Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2019. https://www.aclweb.org/anthology/P19-1054 Classification and Clustering of Arguments with Contextualized Word Embeddings . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 567--578, Florence, Italy....

  31. [31]

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. http://arxiv.org/abs/1503.03832 FaceNet: A Unified Embedding for Face Recognition and Clustering . arXiv preprint arXiv:1503.03832, abs/1503.03832

  32. [32]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. https://www.aclweb.org/anthology/D13-1170 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642, Seattle...

  33. [33]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Attention is All you Need . In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information P...

  34. [34]

    Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. https://doi.org/10.1007/s10579-005-7880-9 Annotating Expressions of Opinions and Emotions in Language . Language Resources and Evaluation, 39(2):165--210

  35. [35]

    Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. http://aclweb.org/anthology/N18-1101 A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112--...

  36. [36]

    Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-Yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. https://www.aclweb.org/anthology/W18-3022 Learning Semantic Textual Similarity from Conversations . In Proceedings of The Third Workshop on Representation Learning for NLP , pages 164--174, Melbourne, Australia. A...

  37. [37]

    Xlnet: Generalized autoregressive pretrain- ing for language understanding

    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. http://arxiv.org/abs/1906.08237 XLNet: Generalized Autoregressive Pretraining for Language Understanding . arXiv preprint arXiv:1906.08237, abs/1906.08237

  38. [38]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. http://arxiv.org/abs/1904.09675 BERTScore: Evaluating Text Generation with BERT . arXiv preprint arXiv:1904.09675