hub Mixed citations

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, Alexandra Birch · 2015 · cs.CL · arXiv 1508.07909

Mixed citation behavior. Most common role is method (67%).

49 Pith papers citing it

Method 67% of classified citations

open full Pith review browse 49 citing papers arXiv PDF

abstract

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 4 background 2

citation-polarity summary

use method 4 background 2

representative citing papers

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.

MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions

cs.CL · 2026-03-29 · unverdicted · novelty 7.0

Language models have an intrinsic randomness floor: transformers show ~0.30 entropic deviation from uniform on neutral prompts, accounting for 88-93% of observed non-randomness, while state-space models exhibit twice the deviation and strong temperature sensitivity.

Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts

cs.LG · 2025-12-18 · unverdicted · novelty 7.0

The first HTR pipeline for Old Nepali manuscripts achieves 4.9% character error rate with released training code and scripts for low-resource historical scripts.

TokenChain: A Discrete Speech Chain via Semantic Token Modeling

eess.AS · 2025-10-07 · unverdicted · novelty 7.0

TokenChain demonstrates that a discrete semantic-token interface can sustain effective chain learning between ASR and TTS, yielding faster convergence and lower error rates on LibriSpeech and TED-LIUM.

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

cs.LG · 2024-08-28 · conditional · novelty 7.0

Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.

Chronos: Learning the Language of Time Series

cs.LG · 2024-03-12 · conditional · novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

cs.LG · 2022-08-15 · conditional · novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

Fine-Tuning Language Models from Human Preferences

cs.CL · 2019-09-18 · unverdicted · novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.

Deep Learning Scaling is Predictable, Empirically

cs.LG · 2017-12-01 · unverdicted · novelty 7.0

Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.

Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

SVD on the lm_head weight matrix of transformers reveals interpretable vocabulary clusters that indicate training data composition, model differences, and ethical concerns in models like GPT-OSS, Gemma, and Qwen.

A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research

cs.CL · 2026-05-15 · conditional · novelty 6.0

A RoBERTa classifier trained on LLM-generated manner/result verb annotations from extended VerbNet data reaches up to 89.6% accuracy on held-out gold-standard sets.

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

BabelDOC uses an intermediate representation to decouple layout from content for improved layout-preserving PDF translation.

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.

VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection

cs.CR · 2026-04-29 · unverdicted · novelty 6.0

VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.

Informativity of Data-Knowledge Pairs for Lyapunov Equations

eess.SY · 2026-04-20 · unverdicted · novelty 6.0

The paper derives an algebraic condition for when data-knowledge pairs are jointly informative enough to uniquely solve Lyapunov equations.

Multi Language Models for On-the-Fly Syntax Highlighting

cs.SE · 2025-10-05 · unverdicted · novelty 6.0

Unified multi-language deep learning model for on-the-fly syntax highlighting using normalization and few-shot learning to support six languages with lower deployment cost.

FAST: Efficient Action Tokenization for Vision-Language-Action Models

cs.RO · 2025-01-16 · unverdicted · novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.

MVIGER: Multi-View Variational Integration of Complementary Knowledge for Generative Recommender

cs.IR · 2024-08-16 · unverdicted · novelty 6.0

MVIGER integrates complementary knowledge from diverse prompts and indices in generative recommenders via a variational model with learnable prior over latent sources, showing superior performance on three datasets.

citing papers explorer

Showing 49 of 49 citing papers.

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization cs.LG · 2026-05-13 · unverdicted · none · ref 31 · internal anchor
Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models cs.LG · 2026-05-11 · unverdicted · none · ref 34 · internal anchor
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them cs.CL · 2026-04-18 · unverdicted · none · ref 40 · internal anchor
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts cs.CL · 2026-04-13 · unverdicted · none · ref 23 · internal anchor
MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings q-bio.QM · 2026-04-09 · unverdicted · none · ref 53 · internal anchor
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions cs.CL · 2026-03-29 · unverdicted · none · ref 5 · internal anchor
Language models have an intrinsic randomness floor: transformers show ~0.30 entropic deviation from uniform on neutral prompts, accounting for 88-93% of observed non-randomness, while state-space models exhibit twice the deviation and strong temperature sensitivity.
Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts cs.LG · 2025-12-18 · unverdicted · none · ref 5 · internal anchor
The first HTR pipeline for Old Nepali manuscripts achieves 4.9% character error rate with released training code and scripts for low-resource historical scripts.
TokenChain: A Discrete Speech Chain via Semantic Token Modeling eess.AS · 2025-10-07 · unverdicted · none · ref 22 · internal anchor
TokenChain demonstrates that a discrete semantic-token interface can sustain effective chain learning between ASR and TTS, yielding faster convergence and lower error rates on LibriSpeech and TED-LIUM.
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts cs.LG · 2024-08-28 · conditional · none · ref 6 · internal anchor
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
Chronos: Learning the Language of Time Series cs.LG · 2024-03-12 · conditional · none · ref 77 · internal anchor
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 95 · internal anchor
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 208 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 64 · internal anchor
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
Fine-Tuning Language Models from Human Preferences cs.CL · 2019-09-18 · unverdicted · none · ref 26 · internal anchor
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Deep Learning Scaling is Predictable, Empirically cs.LG · 2017-12-01 · unverdicted · none · ref 8 · internal anchor
Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have) cs.LG · 2026-05-21 · unverdicted · none · ref 15 · internal anchor
SVD on the lm_head weight matrix of transformers reveals interpretable vocabulary clusters that indicate training data composition, model differences, and ethical concerns in models like GPT-OSS, Gemma, and Qwen.
A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research cs.CL · 2026-05-15 · conditional · none · ref 57 · internal anchor
A RoBERTa classifier trained on LLM-generated manner/result verb annotations from extended VerbNet data reaches up to 89.6% accuracy on held-out gold-standard sets.
BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation cs.CV · 2026-05-11 · unverdicted · none · ref 13 · internal anchor
BabelDOC uses an intermediate representation to decouple layout from content for improved layout-preserving PDF translation.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs cs.CL · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection cs.CR · 2026-04-29 · unverdicted · none · ref 33 · internal anchor
VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.
Informativity of Data-Knowledge Pairs for Lyapunov Equations eess.SY · 2026-04-20 · unverdicted · none · ref 16 · internal anchor
The paper derives an algebraic condition for when data-knowledge pairs are jointly informative enough to uniquely solve Lyapunov equations.
Multi Language Models for On-the-Fly Syntax Highlighting cs.SE · 2025-10-05 · unverdicted · none · ref 15 · internal anchor
Unified multi-language deep learning model for on-the-fly syntax highlighting using normalization and few-shot learning to support six languages with lower deployment cost.
FAST: Efficient Action Tokenization for Vision-Language-Action Models cs.RO · 2025-01-16 · unverdicted · none · ref 57 · internal anchor
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.
MVIGER: Multi-View Variational Integration of Complementary Knowledge for Generative Recommender cs.IR · 2024-08-16 · unverdicted · none · ref 25 · internal anchor
MVIGER integrates complementary knowledge from diverse prompts and indices in generative recommenders via a variational model with learnable prior over latent sources, showing superior performance on three datasets.
Efficient Training of Language Models to Fill in the Middle cs.CL · 2022-07-28 · unverdicted · none · ref 64 · internal anchor
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 260 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation cs.CV · 2022-06-22 · unverdicted · none · ref 32 · internal anchor
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
CoCa: Contrastive Captioners are Image-Text Foundation Models cs.CV · 2022-05-04 · accept · none · ref 43 · internal anchor
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
LaMDA: Language Models for Dialog Applications cs.CL · 2022-01-20 · unverdicted · none · ref 91 · internal anchor
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 247 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation cs.SE · 2021-02-09 · unverdicted · none · ref 69 · internal anchor
CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 141 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
CTRL: A Conditional Transformer Language Model for Controllable Generation cs.CL · 2019-09-11 · unverdicted · none · ref 41 · internal anchor
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks cs.CL · 2019-07-25 · unverdicted · none · ref 12 · internal anchor
DropAttention regularizes attention weights in fully-connected self-attention networks to reduce overfitting and improve performance.
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts cs.CL · 2019-06-28 · conditional · none · ref 25 · internal anchor
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
Informative Image Captioning with External Sources of Information cs.CL · 2019-06-20 · unverdicted · none · ref 19 · internal anchor
A multimodal Transformer ingests image features plus multiple external entity label sources and learns to control their appearance in fluent output captions.
In Search of Lost DNA Sequence Pretraining cs.LG · 2026-04-17 · unverdicted · none · ref 31 · internal anchor
DNA pretraining suffers from inappropriate evaluation datasets, flawed neighbor-masking, and neglected vocabulary design; the authors supply guidelines and a reproducible testbed to fix them.
Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings cs.SI · 2026-04-08 · unverdicted · none · ref 37 · internal anchor
LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.
Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate cs.CL · 2026-04-06 · conditional · none · ref 12 · internal anchor
Empirical tests on software engineering tasks show Chinese prompts do not reduce token usage or improve success rates over English for LLM coding.
WorldVLA: Towards Autoregressive Action World Model cs.RO · 2025-06-26 · unverdicted · none · ref 22 · internal anchor
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
DAPL: Integration of Positive and Negative Descriptions in Text-Based Person Search cs.CV · 2024-05-13 · unverdicted · none · ref 5 · internal anchor
DAPL integrates positive and negative descriptions using Dual Image-Attribute Contrastive learning, Sensitive Image-Attribute Matching, and Dynamic Token-wise Similarity loss to improve text-based person search accuracy.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence cs.SE · 2024-01-25 · unverdicted · none · ref 15 · internal anchor
DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 232 · internal anchor
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Root Mean Square Layer Normalization cs.LG · 2019-10-16 · conditional · none · ref 24 · internal anchor
RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.
Agglomerative Attention cs.LG · 2019-07-15 · unverdicted · none · ref 32 · internal anchor
Presents agglomerative attention, a linear-complexity attention model that achieves comparable performance to full attention on language modeling tasks.
Attention Is All You Need cs.CL · 2017-06-12 · unverdicted · none · ref 31 · internal anchor
Pith review generated a malformed one-line summary.
ClinQueryAgent: A Conversational Agent for Population Health Management cs.IR · 2026-04-13 · unverdicted · none · ref 169 · internal anchor
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
Robust Machine Translation with Domain Sensitive Pseudo-Sources: Baidu-OSU WMT19 MT Robustness Shared Task System Report cs.CL · 2019-06-19 · unverdicted · none · ref 20 · internal anchor
Baidu-OSU WMT19 system achieves >10 BLEU gain on En-Fr and Fr-En social media translation via domain sensitive training and pseudo noisy sources.
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan cs.CL · 2026-05-09 · unreviewed · ref 23 · internal anchor

Neural Machine Translation of Rare Words with Subword Units

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer