BioBERT, ClinicalBERT, PubMedBERT: pre-trained transformers for the biomedical domain

The generation of BERT models pre-trained on biomedical corpora: BioBERT (Korea, 2019), ClinicalBERT (Harvard/NYU, 2019), PubMedBERT (Microsoft, 2020), BlueBERT (NIH, 2019), SciBERT (Allen AI), GatorTron (Florida, 2022) and the Hugging Face ecosystem.

Digital HealthR&DOpen SourceAI BioBERTClinicalBERTPubMedBERTTransformerNLPBiomedicalHugging FaceOpen SourceDigital Health

After BERT, adaptation to the medical domain

BERT (Bidirectional Encoder Representations from Transformers) — published October 2018 by Devlin et al. (Google) — redefined NLP state of the art. The approach: a Transformer network pre-trained on large amounts of generic text (Wikipedia + BookCorpus) with self-supervised tasks (masked language modelling, next sentence prediction), then fine-tuned on specific downstream tasks (sentiment analysis, question answering, NER).

For the biomedical NLP community, BERT opens a concrete possibility: pre-train a BERT specifically on biomedical corpora instead of generic text, to better capture technical vocabulary, medical abbreviations, clinical note syntax, drug and disease names. Subsequent fine-tuning on specific biomedical tasks — clinical entity NER, relation extraction, document classification — should produce superior results.

This insight has generated a model family, all available as open source through the Hugging Face ecosystem and authors’ repositories.

BioBERT

BioBERTBidirectional Encoder Representations from Transformers for Biomedical Text Mining — was published by the team of Jinhyuk Lee, Wonjin Yoon, Sungdong Kim and other authors from Korea University and Naver/Clova AI. The preprint is from 2019, full publication in Bioinformatics in 2020.

The model:

  • Starts from BERT-base (110 million parameters, 12 layers, 768 hidden dim)
  • Continues pre-training on PubMed abstracts (4.5 billion words) and PMC full-text articles (13.5 billion words)
  • Releases several variants: BioBERT-Base v1.0, v1.1; a “cased” variant preserving capitalisation (relevant in biomedicine where “CD4” ≠ “cd4”)

Pre-training uses the original BERT vocabulary (problematic for rare biomedical terms often fragmented into too many sub-tokens); in comparative evaluations BioBERT improves over standard BERT by ~3-5 F1 points on biomedical NER tasks (diseases, chemicals, genes), relation extraction, QA (BioASQ).

Licence is Apache 2.0; weights are distributed free via GitHub and Hugging Face Hub.

ClinicalBERT

The distinction between biomedical (scientific literature) and clinical (record notes, discharge letters) matters: clinical language is more telegraphic, rich in local abbreviations, with distinctive syntactic structures. A biomedical BERT pre-trained on PubMed is not optimal for clinical text.

Two papers titled ClinicalBERT appear in 2019:

  • Alsentzer et al. (2019) — Harvard Medical School and MIT — starts from BioBERT and continues pre-training on MIMIC-III (public ICU note dataset from Beth Israel Deaconess Medical Center). The resulting model is particularly strong on MIMIC tasks (i2b2, n2c2 challenges)
  • Huang et al. (2019) — NYU — starts directly from BERT and pre-trains on clinical notes, focused on hospital readmission prediction

Both distributed under MIT licence and published on Hugging Face.

PubMedBERT

A more recent variant, published by Microsoft Research with Yu Gu et al. (2020)“Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing” — changes strategic approach:

  • Pre-training from scratch on PubMed (does not start from generic BERT)
  • Vocabulary built directly on the PubMed corpus (better adapted to medical terms)
  • On the BLURB benchmark (Biomedical Language Understanding and Reasoning Benchmark, also introduced in the paper), PubMedBERT surpasses BioBERT

PubMedBERT distributed under MIT licence; base and large variants.

SciBERT, BlueBERT, SapBERT and others

The ecosystem has diversified rapidly:

  • SciBERT (Beltagy et al. 2019, Allen AI) — BERT pre-trained on multidisciplinary scientific texts (computer science + biomedicine)
  • BlueBERT (Peng et al. 2019, NIH) — BERT pre-trained on PubMed + MIMIC
  • SapBERT (Liu et al. 2020-2021, Cambridge) — specialised for medical entity linking, excellent at mapping mentions to UMLS concepts
  • BioLinkBERT (Yasunaga et al. 2022, Stanford) — pre-training enriched with citation links
  • Med-BERT (Rasmy et al. 2021) — BERT trained on ICD diagnostic codes in clinical records (not free text, but code sequences)
  • RadBERT, PathologyBERT, BERTweet for Health — specialist variants

GatorTron

A scale jump arrives with GatorTron, published in March 2022 by Yonghui Wu’s group at the University of Florida in collaboration with NVIDIA. GatorTron is pre-trained on 90 billion words of clinical text (UF Health records + MIMIC + literature), with 345M, 3.9B and 8.9B parameter variants. It is significantly larger than traditional BERTs (~100-300M parameters).

GatorTron shows that scale produces significant improvements across clinical NLP tasks. Models up to 3.9B have been released publicly; the 8.9B variant has more controlled distribution.

Hugging Face Transformers as infrastructure

The ecosystem of these models is made accessible by Hugging Face Transformers — open source Python library launched in 2018, today the standard platform for transformer model use and distribution. With 3 lines of code a researcher can load BioBERT, tokenise clinical text, extract embeddings or run inference:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")
model = AutoModel.from_pretrained("dmis-lab/biobert-v1.1")

The Hugging Face Hub hosts thousands of community-contributed biomedical models, with documentation cards, performance metrics, usage examples.

Applicative tasks

Biomedical transformers have advanced state of the art on many clinically relevant tasks:

  • Clinical Named Entity Recognition — identifying drugs, diseases, symptoms, procedures in clinical notes, surpassing cTAKES and rule-based tools on benchmark datasets (i2b2, NCBI Disease, BC5CDR)
  • Relation Extraction — linking entities (drug causes side-effect, gene regulates disease)
  • Clinical concept normalisation — mapping mentions to UMLS/SNOMED CT codes
  • Document classification — diagnostic code prediction, ICU mortality prediction, readmission prediction
  • Biomedical Question Answering — BioASQ, PubMedQA
  • Text summarisation of clinical documents or scientific abstracts
  • Clinical note section segmentation — automatic identification of sections (history, physical exam, plan)

Limits

Biomedical transformer models as of 2022 have important limits:

  • Language — nearly all trained in English. For Italian, German, French, Spanish, adoption requires specific models (emerging in subsequent years; Italian BioBIT is a first attempt)
  • Context length — limited to 512 tokens in base variants, too short for a whole clinical document. Extensions like Longformer, BigBird partially resolve this
  • Limited reasoning — “classic” transformers produce embeddings and classifications but have shallow reasoning; tasks requiring complex inferential chains are challenging
  • Knowledge update — a 2019-pretrained model does not know new discoveries (e.g. drugs approved after training); periodic retraining is costly
  • Bias — models reflect training dataset bias (demographics, underrepresented conditions)
  • Regulation — using BioBERT in a certified clinical product requires IEC 62304 qualification, risk management, clinical validation

The Italian landscape

As of 2022 the use of BioBERT/ClinicalBERT in Italy is primarily research and applied research in collaboration with hospitals:

  • Information extraction from clinical records in research governance projects
  • De-identification pipelines for clinical notes
  • Coding support (ICD-9-CM suggestion on discharge letters)
  • Information retrieval over Italian biomedical literature

The open node is availability of Italian models pre-trained on Italian clinical text. BioBIT (RWTH Aachen, 2022) is a first reference; autonomous Italian projects are emerging.

Outlook

As of June 2022 observable:

  • Scaling up — ever larger models (GatorTron 9B, and beyond) with scaling performance
  • Multilingual medical transformers — cross-language models will emerge useful for non-English markets
  • Integration with knowledge bases — BERT + knowledge graphs (SNOMED, UMLS) for structured reasoning
  • Emergence of generative large language models — GPT-3 (2020), PaLM (2022), and subsequents are starting to be explored for medical use. Will change the landscape in the next two years
  • HIPAA/GDPR compliance — on-premise deployment patterns with pre-trained models to protect clinical data
  • RAG (Retrieval-Augmented Generation) — combining pre-trained models with external document retrieval to answer factual questions with cited sources

BioBERT and the encoder-only transformer biomedical generation are today standard baselines in clinical NLP. The next wave — generative decoder-only and encoder-decoder LLMs — is preparing to redefine applicative possibilities in healthcare again.


References: Lee, Yoon, Kim et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”, Bioinformatics (2020). Alsentzer et al., “Publicly Available Clinical BERT Embeddings” (2019). Huang et al., “ClinicalBERT” (2019). Gu et al., “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing” (2020, PubMedBERT). Beltagy et al., SciBERT (2019). Peng et al., BlueBERT (2019). Yang et al., GatorTron (2022). Hugging Face Transformers. Apache 2.0 / MIT licences.

Need support? Under attack? Service Status
Need support? Under attack? Service Status