SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
CCM atrix: Mining Billions of High-Quality Parallel Sentences on the Web
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CL 5verdicts
UNVERDICTED 5roles
dataset 1polarities
use dataset 1representative citing papers
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.
Compact 0.8B-7B models for bidirectional Japanese-English translation outperform large multilingual models on real-world domain benchmarks.
Lius improves LLM translation for Kupang Malay by 4-13 points over baselines via continual instruction tuning with dictionary-derived instructions.
citing papers explorer
-
SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
Text Embeddings by Weakly-Supervised Contrastive Pre-training
E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.
-
CAT-Translate: Building Compact Open-Source Models for Japanese-English Translation
Compact 0.8B-7B models for bidirectional Japanese-English translation outperform large multilingual models on real-world domain benchmarks.
-
Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay
Lius improves LLM translation for Kupang Malay by 4-13 points over baselines via continual instruction tuning with dictionary-derived instructions.