IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
Late chunking: Contextual chunk embeddings using long-context embedding models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
SPIRE presents a tree-structured retrieval method using subdocuments, paths, and dual contextualization that produces higher-quality and more diverse citations than passage-based baselines on HTML QA benchmarks.
Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.
Empirical study claiming to be the first broad comparison of chunking methods in RAG, highlighting effectiveness, cost, and generalization limitations across scenarios.
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.
citing papers explorer
-
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.