pith. machine review for the scientific record. sign in

arxiv: 1810.04805 · v2 · submitted 2018-10-11 · 💻 cs.CL

Recognition: unknown

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Kenton Lee, Kristina Toutanova, Ming-Wei Chang

Authors on Pith no claims yet
classification 💻 cs.CL
keywords bertlanguageabsoluteimprovementbidirectionalpointansweringdeep
0
0 comments X
read the original abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. Online Learning-to-Defer with Varying Experts

    stat.ML 2026-05 unverdicted novelty 8.0

    Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

  3. Learning to Unscramble Feynman Loop Integrals with SAILIR

    hep-ph 2026-04 unverdicted novelty 8.0

    A self-supervised transformer learns to unscramble Feynman integrals for online IBP reduction, delivering bounded memory use on complex two-loop topologies while matching Kira's speed on the hardest cases tested.

  4. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  5. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    cs.AI 2023-06 conditional novelty 8.0

    LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.

  6. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    cs.CV 2022-08 unverdicted novelty 8.0

    Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

  7. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

    cs.IR 2021-04 accept novelty 8.0

    BEIR is a heterogeneous zero-shot benchmark showing BM25 as a robust baseline while re-ranking and late-interaction models perform best on average at higher cost, with dense and sparse models lagging in generalization.

  8. The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    cs.CL 2020-12 conditional novelty 8.0

    The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...

  9. Measuring Massive Multitask Language Understanding

    cs.CY 2020-09 accept novelty 8.0

    Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

  10. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  11. Reformer: The Efficient Transformer

    cs.LG 2020-01 accept novelty 8.0

    Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

  12. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    cs.CL 2019-08 unverdicted novelty 8.0

    Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matchin...

  13. HellaSwag: Can a Machine Really Finish Your Sentence?

    cs.CL 2019-05 unverdicted novelty 8.0

    HellaSwag dataset shows state-of-the-art models fail commonsense inference tasks that humans solve easily, built via adversarial filtering of distractors.

  14. Passage Re-ranking with BERT

    cs.IR 2019-01 unverdicted novelty 8.0

    Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.

  15. Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.

  16. Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

    cs.CV 2026-05 unverdicted novelty 7.0

    CoDAAR creates a unified discrete representation space for multimodal sequences by aligning modality-specific codebooks through index-level semantic consensus, enabling both specificity and cross-modal generalization.

  17. Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

    cs.CV 2026-05 unverdicted novelty 7.0

    CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state...

  18. Enhancing Healthcare Search Intent Recognition with Query Representation Learning and Session Context

    cs.IR 2026-05 unverdicted novelty 7.0

    Clustering-based query representations with a novel multi-intent loss and a concordance rate metric improve healthcare search intent classification on two real-world log datasets.

  19. From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

  20. NeuralBench: A Unifying Framework to Benchmark NeuroAI Models

    cs.LG 2026-05 conditional novelty 7.0

    NeuralBench is a new benchmarking framework for neuroAI models on EEG data that finds foundation models only marginally outperform task-specific ones while many tasks like cognitive decoding stay highly challenging.

  21. Flexible Routing via Uncertainty Decomposition

    cs.LG 2026-05 unverdicted novelty 7.0

    A router that decomposes uncertainty to flexibly route queries between cheap models and oracles while providing regret bounds and supporting abstention in classification tasks with multiple annotations.

  22. Is She Even Relevant? When BERT Ignores Explicit Gender Cues

    cs.CL 2026-05 conditional novelty 7.0

    A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.

  23. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  24. Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

    cs.LG 2026-05 conditional novelty 7.0

    Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...

  25. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  26. Conditional generation of antibody sequences with classifier-guided germline-absorbing discrete diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    Germline-absorbing discrete diffusion uses the germline sequence as the absorbing state to reduce germline bias in antibody modeling, raising non-germline residue prediction accuracy from 26% to 46% and improving cond...

  27. TCRTransBench: A Comprehensive Benchmark for Bidirectional TCR-Peptide Sequence Generation

    q-bio.CB 2026-05 unverdicted novelty 7.0

    TCRTransBench provides a new benchmark with bidirectional TCR-peptide generation tasks, a large validated dataset, and metrics to evaluate neural models for immunological sequence modeling.

  28. Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties

    cs.CL 2026-05 unverdicted novelty 7.0

    A framework with TOPPing source selection and VACAI-Bowl dual-branch model yields 54.62% average improvement in dependency parsing across 10 low-resource varieties.

  29. Deep Graph-Language Fusion for Structure-Aware Code Generation

    cs.SE 2026-05 unverdicted novelty 7.0

    CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

  30. MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports

    cs.CL 2026-05 unverdicted novelty 7.0

    MedStruct-S benchmark shows encoder-only models outperform larger decoder-only ones on key-conditioned QA from noisy OCR clinical reports, with fine-tuned large models winning only when scale is ignored.

  31. Reconstructing conformal field theoretical compositions with Transformers

    hep-th 2026-05 unverdicted novelty 7.0

    Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.

  32. Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs

    cs.AI 2026-04 unverdicted novelty 7.0

    MEDS is a dataset of 28,000 LLM personas performing high-school math tasks alongside psychometric tests and cognitive networks that capture math anxiety, self-efficacy, and confidence to support safer AI tutors.

  33. Identifying and Characterizing Semantic Clones of Solidity Functions

    cs.SE 2026-04 unverdicted novelty 7.0

    A code-and-comment analysis method detects semantic clones in Solidity functions with 59% overall precision (84% for same-name functions) and 97% recall on 300k contracts, plus LLM summaries for uncommented code.

  34. OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniVTG creates a new large-scale open-world VTG dataset using iterative concept-gap filling and timestamped captioning, paired with a three-stage self-correction CoT paradigm that yields SOTA zero-shot results on fou...

  35. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  36. Provably Secure Steganography Based on List Decoding

    cs.CR 2026-04 conditional novelty 7.0

    List decoding enables a provably secure steganography scheme with higher embedding capacity for LLMs via candidate sets and suffix matching.

  37. Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider

    hep-ph 2026-04 unverdicted novelty 7.0

    The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.

  38. Sparse Contrastive Learning for Content-Based Cold Item Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    SEMCo uses sparse entmax contrastive learning for purely content-based cold-start item recommendation, outperforming standard methods in ranking accuracy.

  39. SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates

    cs.LG 2026-04 unverdicted novelty 7.0

    LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.

  40. LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset

    cs.CL 2026-04 unverdicted novelty 7.0

    LASQ is a new quadruple extraction dataset for Uzbek and Uyghur that includes a syntax-aware model showing gains over baselines on the task.

  41. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  42. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  43. Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

    cs.CL 2026-04 unverdicted novelty 7.0

    Kathleen performs byte-level text classification via recurrent oscillator banks, FFT wavetable encoding, and phase harmonics, matching pretrained baselines on standard benchmarks with 36% fewer parameters.

  44. Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

    cs.CL 2026-04 unverdicted novelty 7.0

    Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.

  45. Graph Topology Information Enhanced Heterogeneous Graph Representation Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    ToGRL learns high-quality graph structures from raw heterogeneous graphs via a two-stage topology extraction process and prompt tuning, outperforming prior methods on five datasets.

  46. Unlocking Prompt Infilling Capability for Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

  47. Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

    cs.DC 2026-04 unverdicted novelty 7.0

    Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

  48. Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

    cs.CV 2026-04 unverdicted novelty 7.0

    Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garm...

  49. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

    cs.AR 2026-03 unverdicted novelty 7.0

    ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

  50. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  51. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  52. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    cs.CL 2024-02 unverdicted novelty 7.0

    M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...

  53. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    cs.CV 2024-01 conditional novelty 7.0

    Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.

  54. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    cs.CL 2023-10 conditional novelty 7.0

    Fine-tuning aligned LLMs compromises safety guardrails even with minimal adversarial examples or benign data, creating new risks not covered by existing inference-time protections.

  55. C-Pack: Packed Resources For General Chinese Embeddings

    cs.CL 2023-09 accept novelty 7.0

    C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

  56. Steering Language Models With Activation Engineering

    cs.CL 2023-08 unverdicted novelty 7.0

    Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.

  57. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    cs.RO 2023-04 conditional novelty 7.0

    Low-cost imprecise robots achieve 80-90% success on six fine bimanual manipulation tasks using imitation learning with a new Action Chunking with Transformers algorithm trained on only 10 minutes of demonstrations.

  58. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    cs.AI 2023-04 accept novelty 7.0

    LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.

  59. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  60. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    cs.LG 2022-11 conditional novelty 7.0

    PatchTST uses subseries patching and channel-independent Transformers to deliver significantly better long-term multivariate time series forecasting and strong self-supervised transfer performance.