Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.
hub Mixed citations
Neural Machine Translation of Rare Words with Subword Units
Mixed citation behavior. Most common role is method (67%).
abstract
Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Incremental k-center clustering admits no better than 2-approximation even for non-polynomial algorithms, via a new lower-bound construction.
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
Language models have an intrinsic randomness floor: transformers show ~0.30 entropic deviation from uniform on neutral prompts, accounting for 88-93% of observed non-randomness, while state-space models exhibit twice the deviation and strong temperature sensitivity.
The first HTR pipeline for Old Nepali manuscripts achieves 4.9% character error rate with released training code and scripts for low-resource historical scripts.
TokenChain demonstrates that a discrete semantic-token interface can sustain effective chain learning between ASR and TTS, yielding faster convergence and lower error rates on LibriSpeech and TED-LIUM.
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
TRADE augments multimodal Speech LLMs with a transducer branch for streaming ASR, reporting 6.71% WER offline and 8.40% streaming on the Open ASR Leaderboard from one checkpoint.
SVD on the lm_head weight matrix of transformers reveals interpretable vocabulary clusters that indicate training data composition, model differences, and ethical concerns in models like GPT-OSS, Gemma, and Qwen.
A RoBERTa classifier trained on LLM-generated manner/result verb annotations from extended VerbNet data reaches up to 89.6% accuracy on held-out gold-standard sets.
BabelDOC uses an intermediate representation to decouple layout from content for improved layout-preserving PDF translation.
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.
The paper derives an algebraic condition for when data-knowledge pairs are jointly informative enough to uniquely solve Lyapunov equations.
Unified multi-language deep learning model for on-the-fly syntax highlighting using normalization and few-shot learning to support six languages with lower deployment cost.
citing papers explorer
-
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
-
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.
-
The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions
Language models have an intrinsic randomness floor: transformers show ~0.30 entropic deviation from uniform on neutral prompts, accounting for 88-93% of observed non-randomness, while state-space models exhibit twice the deviation and strong temperature sensitivity.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Fine-Tuning Language Models from Human Preferences
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
-
TRADE: Transducer-Augmented Decoder for Speech LLM
TRADE augments multimodal Speech LLMs with a transducer branch for streaming ASR, reporting 6.71% WER offline and 8.40% streaming on the Open ASR Leaderboard from one checkpoint.
-
A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research
A RoBERTa classifier trained on LLM-generated manner/result verb annotations from extended VerbNet data reaches up to 89.6% accuracy on held-out gold-standard sets.
-
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
-
Efficient Training of Language Models to Fill in the Middle
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
LaMDA: Language Models for Dialog Applications
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
CTRL: A Conditional Transformer Language Model for Controllable Generation
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
-
DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks
DropAttention regularizes attention weights in fully-connected self-attention networks to reduce overfitting and improve performance.
-
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
-
Informative Image Captioning with External Sources of Information
A multimodal Transformer ingests image features plus multiple external entity label sources and learns to control their appearance in fluent output captions.
-
Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate
Empirical tests on software engineering tasks show Chinese prompts do not reduce token usage or improve success rates over English for LLM coding.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Attention Is All You Need
Pith review generated a malformed one-line summary.
-
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale
Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
-
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
Introduces an interpretable DL framework to analyze grammatical gender restructuring from Latin to Occitan, with a new tokenizer and quantification of morphological and POS contributions to gender prediction.
-
PortBERT: Navigating the Depths of Portuguese Language Models
PortBERT releases two RoBERTa models for Portuguese that match or beat prior monolingual and multilingual models on translated GLUE/SuperGLUE tasks while reporting training and inference times.
-
Robust Machine Translation with Domain Sensitive Pseudo-Sources: Baidu-OSU WMT19 MT Robustness Shared Task System Report
Baidu-OSU WMT19 system achieves >10 BLEU gain on En-Fr and Fr-En social media translation via domain sensitive training and pseudo noisy sources.