SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
Canonical reference
Title resolution pending
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
IdeaBlocks modularizes divergent intents into Exploration Blocks with multi-level reuse options, enabling 2.13 times more images explored and 12.5% greater visual diversity than baseline in a comparative user study.
ML4AVD research remains locked into binary function-level classification of C/C++ vulnerabilities because twelve pain points in the pipeline reinforce each other through feedback loops.
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
FIESTA uses bandit algorithms to adaptively decide how many seeds and splits to run for each candidate model, focusing effort on promising ones while providing guarantees on selecting the optimal model.
Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.
REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in extractable vs. conjunctive uses.
QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.
WM-SAR decomposes sarcasm into LLM-agent components, quantifies literal-normative inconsistency deterministically, and integrates it with intention via logistic regression to outperform prior sarcasm detectors on benchmarks.
Sparse Concept Anchoring biases neural latent spaces toward targeted concepts using under 0.1% labels per concept, enabling reversible steering via projection and permanent removal via weight ablation with minimal side effects on other features.
TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a factor of approximately 2.7.
Nomic AI produced and open-sourced a reproducible 8192-context English text embedder that exceeds OpenAI Ada-002 and text-embedding-3-small performance on MTEB short-context and LoCo long-context benchmarks.
Entity representations learned from text via link prediction generalize to unseen entities and transfer to classification and retrieval with reported gains of 22% MRR, 16% accuracy, and 8.8% NDCG@10.
A generative model of latent underlying punctuation in dependency trees, trained on incomplete data via local likelihood maximization, produces plausible reconstructions across languages and beats baselines on restoration.
End-to-end hard word clustering via Gumbel-Softmax produces smaller, parameter-efficient text classifiers.
SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.
Compares LLMs against semantic similarity for binary classification of student self-explanations in programming education.
A new encoder-based SRL system with dependency-informed analysis delivers 10x faster inference and comparable or better F1 scores using BERT, RoBERTa, and DeBERTa while supporting multilingual projection.
New Zealand Reddit users link language to place and form contiguous speech communities with complex geographic alignment; Word2Vec embeddings reveal semantic variations and shifts in NZ English on a 4.26 billion word corpus.
Search-R3 trains LLMs to output search embeddings as a direct product of step-by-step reasoning via supervised pre-training and a specialized RL environment that avoids full corpus re-encoding.
Data-CUBE applies a two-level curriculum (TSP-based task ordering via simulated annealing plus difficulty-sorted mini-batches) to multi-task instruction tuning and reports gains on MTEB sentence representation tasks.