MetaMap and UMLS: concept extraction from clinical and biomedical text

The National Library of Medicine's MetaMap, mapping of free text to UMLS Metathesaurus concepts, the Prolog/Java architecture, the NLM tool family and uses in coding, indexing and biomedical literature mining.

Digital HealthR&D MetaMapUMLSNLMNLPBiomedicalConcept ExtractionNatural Language ProcessingOpen SourceDigital Health

Clinical text as data

Much of clinical information — history, physical examination, impression, treatment plan, course, discharge — is represented as free text in the electronic clinical record. Structured components (demographics, coded diagnoses, medications, laboratory) capture a quantitative part; the medical narrative remains the natural way the clinician records diagnostic-therapeutic reasoning.

Extracting computable information from this text — identifying mentioned diseases, drugs, procedures, anatomy, symptoms — is the task of biomedical Natural Language Processing. The central node is the link between surface terms (as the clinician writes them) and controlled concepts of a medical terminology — needed to enable structured search, epidemiological analytics, clinical decision support.

The US National Library of Medicine (NLM) has built over the last twenty years the infrastructure that makes this link possible: UMLS as terminological resource, and MetaMap as text-mapping tool.

UMLS

The Unified Medical Language System is an NLM project begun in the mid-1980s and continuously updated. UMLS is not a single terminology but a federated integration of over 150 source vocabularies (SNOMED CT, MeSH, ICD-9/10, RxNorm, LOINC, CPT, NDF-RT, GO, and dozens more) unified in a common structure. Releases are semi-annual (AA edition in April, AB in November); the current release at time of writing is UMLS 2009AA.

The main UMLS components:

  • Metathesaurus — over 2.5 million distinct concepts, each with a Concept Unique Identifier (CUI) in the form C0000039 (concept type prefix + number). Each CUI groups synonym terms from various source vocabularies, with their descriptions in the available languages
  • Semantic Network — higher level of 135 Semantic Types (Disease or Syndrome, Pharmacologic Substance, Diagnostic Procedure, Anatomical Structure, …) with semantic relations among types (is-a, treats, causes)
  • Specialist Lexicon — computational English lexicon with morphological and syntactic features
  • Tabular files MRCONSO (concept-term mappings), MRSTY (concept-semantic type), MRREL (relations), MRHIER (hierarchies)

UMLS access is free but subject to the UMLS Metathesaurus License Agreement — distributed by NLM upon registration. The licence imposes obligations on source vocabularies (SNOMED CT, CPT have their specific policies).

MetaMap

MetaMap was developed by Alan R. Aronson at NLM in the 1990s as a tool to automatically index MEDLINE articles with MeSH concepts. The central function: given a biomedical text, identify present UMLS concepts and return them as structured annotations with CUI, Semantic Type, position in text, confidence score.

The MetaMap processing pipeline includes:

  1. Parsing — syntactic analysis with the SPECIALIST Parser, identification of candidate noun phrases
  2. Variant generation — for each phrase, generation of lexical variants (singular/plural, inflection, synonyms, expanded acronyms)
  3. Candidate retrieval — matching of variants against the UMLS Metathesaurus, collection of candidate concepts
  4. Candidate evaluation — scoring of candidates according to metrics of centrality, variation, cohesiveness, coverage
  5. Mapping construction — selection of the best mapping for each phrase, with any ambiguity resolved
  6. Output — annotated text representation with assigned CUIs, score, ambiguities

MetaMap 2009 is distributed free of charge under UMLS licence. Recent evolutions include:

  • Word sense disambiguation improved, with phrase context
  • Integrated negation detection (based on Chapman et al.’s NegEx algorithm, 2001)
  • Temporality — identification of temporal expressions
  • Java API (MetaMap Java API, MMJA) — modern alternative to shell-pipe invocation
  • Performance — the 2009 version significantly improved speed over previous ones

The historical implementation is in SICStus Prolog, with progressive Java rewrites of the most used components. The distribution model is compiled binaries for Linux, Mac OS X, Windows, plus source code accessible under UMLS licence.

SemRep and other NLM tools

MetaMap does not stand alone. NLM distributes a suite of biomedical NLP tools:

  • SemRep — extraction of semantic predications from text (“Aspirin TREATS Pain”) using MetaMap as underlying annotation component. Builds a navigable Semantic Knowledge Base, useful for conceptual search
  • SemMedDB — public database of SemRep predications extracted from MEDLINE, used in knowledge discovery projects
  • Essie — full-text search engine integrated with UMLS concepts
  • cTAKES — clinical NLP system developed at Mayo Clinic, based on Apache UIMA. In open source release phase in the coming months (currently internal at Mayo)
  • MedEx — medication NLP system, developed at Vanderbilt
  • MedLEE — clinical NLP system by Carol Friedman, Columbia, pioneering but largely proprietary

Use cases

Biomedical literature indexing

MetaMap’s original use: MEDLINE articles analysed to extract relevant concepts, add MeSH terms automatically, improve PubMed search recall. NLM uses MetaMap internally to update indexing of millions of articles.

Clinical document coding

Discharge documents, visit notes, radiology reports analysed to extract diagnoses, procedures, drugs matched to ICD-9/10, SNOMED, ATC codes. Supports billing (SDO coding) and administrative functions.

Clinical trial eligibility

Automatic matching between trial inclusion/exclusion criteria (textual) and patient history/status (record). Identification of eligible patients with a computable preliminary screening.

Adverse drug event detection

Extraction of adverse event mentions from clinical notes, with linking to suspect drug; use in active pharmacovigilance and pharmacoepidemiology.

Algorithmic phenotyping

Definition of patient cohorts with a given disease from unstructured documentation; contribution to observational and pharmacogenomic studies.

Text mining for discovery

Extraction of entity associations (drug-disease, gene-disease, protein-protein) from scientific literature; basis for literature-based discovery (Swanson discovery) projects.

Limits and trade-offs

MetaMap is not a neutral or perfect solution:

  • Ambiguity — many terms have multiple CUIs (e.g. “cold” can be temperature, common cold, or COLD = Chronic Obstructive Lung Disease); disambiguation requires context that MetaMap handles imperfectly
  • UMLS coverage — UMLS contains very many terminologies but not all sub-areas are equally represented; concept extraction is more accurate in areas with well-developed SNOMED CT than in specialist niches
  • Language — UMLS Metathesaurus contains terms in multiple languages (English, Spanish, French, German, Italian in part), but MetaMap is natively English-oriented. For Italian texts, the typical workflow is: automatic translation → MetaMap → mapping to UMLS concepts; or use of Italian UMLS sub-vocabularies where available (especially Italian MeSH)
  • Volume performance — the Prolog-based pipeline is slow for large text volumes; batch use over millions of documents requires parallelisation and pre-filtering
  • UMLS access — the required licence is a non-trivial obstacle for non-academic projects; commercial scenarios imply compliance checks and sometimes separate agreements with source vocabulary owners

The Italian landscape

Applying MetaMap to Italian clinical text is limited by language barriers: MetaMap is built on English and the SPECIALIST lexicon is English. Some Italian academic research projects (Universities of Turin, Bari, Milan, Rome) have experimented with Italian-English pipelines using automatic translation, with useful but non-productive results.

An alternative response, more sustainable for languages other than English, has been evolution towards multilingual NLP pipelines based on local resources: Italian lexicons, Italian MeSH thesauri, named entity recognition models trained on Italian clinical corpora. Projects like I-CAB (Italian Content Annotation Bank) and EVALITA with clinical tasks are emerging references.

The expected arrival of cTAKES as open source — awaited from the Apache collaboration — could open the field to more flexible linguistic customisations than MetaMap allows.

Outlook

MetaMap remains, as of 2009, the reference tool for biomedical concept mapping in English. Predictable evolution directions:

  • Better integration with real clinical tools — electronic records, coding systems, CDS engines
  • Deeper semantics — not only concept identification but relation structuring (already SemRep’s goal)
  • Accuracy with supervised learning — coupling machine-learning models with rule-based matching
  • Cloud deployment — access to MetaMap as a remote service without distributing the full UMLS Metathesaurus to each user

For those working on biomedical clinical text, MetaMap and UMLS are today — and will likely remain for years — the reference infrastructure. Specialised alternatives (cTAKES, MedLEE, drug- or oncology-specific tools) complement rather than replace the NLM core.


References: MetaMap 2009, Alan R. Aronson, National Library of Medicine (metamap.nlm.nih.gov). UMLS 2009AA, NLM. UMLS Metathesaurus License Agreement. NegEx (Chapman et al., 2001). SPECIALIST Lexicon. SemRep, SemMedDB. MEDLINE / PubMed.

Need support? Under attack? Service Status
Need support? Under attack? Service Status