hub

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

· 2024 · cs.CL · arXiv 2412.13663

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

open full Pith review browse 23 citing papers arXiv PDF

abstract

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Who Owns This Agent? Tracing AI Agents Back to Their Owners

cs.CR · 2026-05-15 · unverdicted · novelty 8.0

A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.

Is She Even Relevant? When BERT Ignores Explicit Gender Cues

cs.CL · 2026-05-08 · conditional · novelty 7.0

A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.

ProteinJEPA: Latent prediction complements protein language models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.

HyperTransport: Amortized Conditioning of T2I Generative Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

HyperTransport amortizes activation steering for T2I models via a hypernetwork that predicts intervention parameters from CLIP embeddings, delivering 3600-7000x speedup and matching per-concept baselines on 167 unseen concepts.

NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

cs.CL · 2026-04-30 · unverdicted · novelty 7.0

NorBERTo, a ModernBERT encoder trained on the largest open Portuguese corpus of 331B tokens, reports top encoder results on several PLUE and ASSIN 2 tasks.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

RetroMotion: Retrocausal Motion Forecasting Models are Instructable

cs.CV · 2025-05-26 · unverdicted · novelty 7.0

Retrocausal transformer decomposes multi-agent motion forecasts into marginals and pairwise joints, models uncertainty with compressed exponentials, achieves strong Waymo results, generalizes to Argoverse 2 and V2X-Seq, and enables implicit instruction following from standard training.

SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

SAVER proposes a conformal groundability gate plus submodular image selector that activates vision only when needed for multimodal named entity recognition and relation extraction, improving F1 while lowering compute.

HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

cs.CL · 2026-05-16 · unverdicted · novelty 6.0

HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up to 72.5 percent cost savings on coding benchmarks while remaining decoupled from具体

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.

Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.

Rag Performance Prediction for Question Answering

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.

Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

cs.CL · 2025-12-11 · unverdicted · novelty 6.0

Explanation biases in feature attribution methods are systematic products of lexical and positional preferences, with observed trade-offs across models and higher bias in anomalous explanations.

Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering

cs.CV · 2025-08-31 · unverdicted · novelty 6.0

PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.

Annotation-Assisted Learning of Treatment Policies From Multimodal Electronic Health Records

cs.LG · 2025-07-28 · unverdicted · novelty 6.0

AACE is an annotation-assisted method for causal policy learning from multimodal EHRs that outperforms risk-based and representation-based baselines on synthetic, semi-synthetic, and real datasets.

Should We Still Pretrain Encoders with Masked Language Modeling?

cs.CL · 2025-07-01 · accept · novelty 6.0

Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

cs.CL · 2026-05-16 · conditional · novelty 5.0

Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.

Efficient Listwise Reranking with Compressed Document Representations

cs.IR · 2026-04-29 · unverdicted · novelty 5.0

RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.

Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding

cs.CL · 2026-04-21 · unverdicted · novelty 5.0

Augmenting commonsense knowledge corpora with negation produces over 2M new triples that benefit LLM negation understanding when used for pre-training.

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

cs.CL · 2026-05-19 · unverdicted · novelty 4.0

m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.

Filter-then-Verify: A Multiphase GNN and ModernBERT Framework for Social Engineering Detection in Email Networks

cs.CR · 2026-05-17 · unverdicted · novelty 4.0

A two-stage GNN-plus-ModernBERT framework detects social engineering attacks in email networks by first filtering structural anomalies at 86% recall and then verifying content to reach over 92% precision on augmented Enron data.

Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters

cs.CL · 2026-03-11 · unverdicted · novelty 4.0

Zero-shot GPT-OSS detects depression from 1,108 primary care encounter transcripts with AUPRC 0.51 and AUROC 0.77, with meaningful signals in the first 128 patient tokens and added value from dyadic mirroring.

A Unified Framework for Modeling Heterogeneous Financial Data via Dual-Granularity Prompting

cs.CE · 2024-04-19 · unverdicted · novelty 4.0

FinLangNet applies dual-granularity prompting in a sequential model to heterogeneous financial data, reporting 6.3 pp KS improvement and 9.9% bad debt reduction in real-world deployment.

citing papers explorer

Showing 23 of 23 citing papers.

Who Owns This Agent? Tracing AI Agents Back to Their Owners cs.CR · 2026-05-15 · unverdicted · none · ref 35 · internal anchor
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
Is She Even Relevant? When BERT Ignores Explicit Gender Cues cs.CL · 2026-05-08 · conditional · none · ref 7 · internal anchor
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
ProteinJEPA: Latent prediction complements protein language models cs.LG · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
HyperTransport: Amortized Conditioning of T2I Generative Models cs.LG · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
HyperTransport amortizes activation steering for T2I models via a hypernetwork that predicts intervention parameters from CLIP embeddings, delivering 3600-7000x speedup and matching per-concept baselines on 167 unseen concepts.
NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus cs.CL · 2026-04-30 · unverdicted · none · ref 4 · internal anchor
NorBERTo, a ModernBERT encoder trained on the largest open Portuguese corpus of 331B tokens, reports top encoder results on several PLUE and ASSIN 2 tasks.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings q-bio.QM · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
RetroMotion: Retrocausal Motion Forecasting Models are Instructable cs.CV · 2025-05-26 · unverdicted · none · ref 51 · internal anchor
Retrocausal transformer decomposes multi-agent motion forecasts into marginals and pairwise joints, models uncertainty with compressed exponentials, achieves strong Waymo results, generalizes to Argoverse 2 and V2X-Seq, and enables implicit instruction following from standard training.
SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction cs.CV · 2026-05-20 · unverdicted · none · ref 21 · internal anchor
SAVER proposes a conformal groundability gate plus submodular image selector that activates vision only when needed for multimodal named entity recognition and relation extraction, improving F1 while lowering compute.
HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools cs.CL · 2026-05-16 · unverdicted · none · ref 17 · internal anchor
HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up to 72.5 percent cost savings on coding benchmarks while remaining decoupled from具体
GLiGuard: Schema-Conditioned Classification for LLM Safeguard cs.CL · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation cs.LG · 2026-04-26 · unverdicted · none · ref 12 · internal anchor
Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.
Rag Performance Prediction for Question Answering cs.CL · 2026-04-09 · unverdicted · none · ref 41 · internal anchor
A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.
Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution cs.CL · 2025-12-11 · unverdicted · none · ref 5 · internal anchor
Explanation biases in feature attribution methods are systematic products of lexical and positional preferences, with observed trade-offs across models and higher bias in anomalous explanations.
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering cs.CV · 2025-08-31 · unverdicted · none · ref 45 · internal anchor
PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.
Annotation-Assisted Learning of Treatment Policies From Multimodal Electronic Health Records cs.LG · 2025-07-28 · unverdicted · none · ref 42 · internal anchor
AACE is an annotation-assisted method for causal policy learning from multimodal EHRs that outperforms risk-based and representation-based baselines on synthetic, semi-synthetic, and real datasets.
Should We Still Pretrain Encoders with Masked Language Modeling? cs.CL · 2025-07-01 · accept · none · ref 43 · internal anchor
Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.
Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning cs.CL · 2026-05-16 · conditional · none · ref 182 · internal anchor
Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.
Efficient Listwise Reranking with Compressed Document Representations cs.IR · 2026-04-29 · unverdicted · none · ref 34 · internal anchor
RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.
Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding cs.CL · 2026-04-21 · unverdicted · none · ref 7 · internal anchor
Augmenting commonsense knowledge corpora with negation produces over 2M new triples that benefit LLM negation understanding when used for pre-training.
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder cs.CL · 2026-05-19 · unverdicted · none · ref 44 · internal anchor
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
Filter-then-Verify: A Multiphase GNN and ModernBERT Framework for Social Engineering Detection in Email Networks cs.CR · 2026-05-17 · unverdicted · none · ref 15 · internal anchor
A two-stage GNN-plus-ModernBERT framework detects social engineering attacks in email networks by first filtering structural anomalies at 86% recall and then verifying content to reach over 92% precision on augmented Enron data.
Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters cs.CL · 2026-03-11 · unverdicted · none · ref 33 · internal anchor
Zero-shot GPT-OSS detects depression from 1,108 primary care encounter transcripts with AUPRC 0.51 and AUROC 0.77, with meaningful signals in the first 128 patient tokens and added value from dyadic mirroring.
A Unified Framework for Modeling Heterogeneous Financial Data via Dual-Granularity Prompting cs.CE · 2024-04-19 · unverdicted · none · ref 51 · internal anchor
FinLangNet applies dual-granularity prompting in a sequential model to heterogeneous financial data, reporting 6.3 pp KS improvement and 9.9% bad debt reduction in real-world deployment.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer