hub

Self-attention with relative position repre- sentations

Peter Shaw, Jakob Uszkoreit, Ashish Vaswani · 2018 · arXiv 1803.02155

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

Video Diffusion Models

cs.CV · 2022-04-07 · unverdicted · novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

HEXST: Hexagonal Shifted-Window Transformer for Spatial Transcriptomics Gene Expression Prediction

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

HEXST applies a hexagonal shifted-window Transformer with rotary positional encodings, contrast-sensitive training objectives, and single-cell priors to predict gene expression from histology slides, outperforming prior models on seven datasets while preserving spatial heterogeneity.

Deep Wave Network for Modeling Multi-Scale Physical Dynamics

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

DW-Net improves the accuracy versus computational cost Pareto front over standard U-Nets for 2D and 3D multi-scale flow benchmarks by stacking multiple waves while keeping training settings identical.

RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

RIHA proposes a hierarchical alignment transformer that uses multi-scale visual and textual feature pyramids plus optimal transport to generate more accurate radiology reports from medical images.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

cs.CL · 2026-04-22 · unverdicted · novelty 5.0

Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.

A Survey on Large Language Models for Code Generation

cs.CL · 2024-06-01 · unverdicted · novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

Large Language Models: A Survey

cs.CL · 2024-02-09 · accept · novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

citing papers explorer

Showing 10 of 10 citing papers.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings q-bio.QM · 2026-04-09 · unverdicted · none · ref 23
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
Video Diffusion Models cs.CV · 2022-04-07 · unverdicted · none · ref 45
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 66
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
HEXST: Hexagonal Shifted-Window Transformer for Spatial Transcriptomics Gene Expression Prediction cs.LG · 2026-05-06 · unverdicted · none · ref 20
HEXST applies a hexagonal shifted-window Transformer with rotary positional encodings, contrast-sensitive training objectives, and single-cell priors to predict gene expression from histology slides, outperforming prior models on seven datasets while preserving spatial heterogeneity.
Deep Wave Network for Modeling Multi-Scale Physical Dynamics cs.LG · 2026-05-05 · unverdicted · none · ref 65
DW-Net improves the accuracy versus computational cost Pareto front over standard U-Nets for 2D and 3D multi-scale flow benchmarks by stacking multiple waves while keeping training settings identical.
RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation cs.CV · 2026-04-30 · unverdicted · none · ref 68
RIHA proposes a hierarchical alignment transformer that uses multi-scale visual and textual feature pyramids plus optimal transport to generate more accurate radiology reports from medical images.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 89
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity cs.CL · 2026-04-22 · unverdicted · none · ref 38
Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 237
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 126
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Self-attention with relative position repre- sentations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer