cTAKES: open source clinical NLP from Mayo Clinic on Apache UIMA

The release of cTAKES by Mayo Clinic (2010), architecture on Apache UIMA, annotation pipeline (tokenisation, POS, chunking, UMLS dictionary lookup, NER, NegEx), comparison with MetaMap and role in the i2b2 Obesity Challenge.

Digital HealthR&DOpen Source cTAKESMayo ClinicApache UIMANLPClinical NLPUMLSInformation ExtractionOpen SourceDigital Health

A production-quality clinical NLP infrastructure

The biomedical Natural Language Processing tool landscape as of 2010 includes NLM’s MetaMap for concept extraction, Columbia’s MedLEE (not open source), numerous unmaintained academic projects. Missing is a production-grade open source tool specifically designed for clinical — not biomedical in general — text, with the completeness of a modern NLP pipeline: tokenisation, morphosyntactic analysis, biomedical NER, assertion (positive/negated/hypothetical), UMLS integration.

Mayo Clinic has answered this need, internally developing since 2006 an NLP system to analyse millions of clinical notes from its enterprise information system. The system, called cTAKESclinical Text Analysis and Knowledge Extraction System — was published in JAMIA in 2010 by Guergana Savova’s group (Journal of the American Medical Informatics Association, Savova et al. 2010) and made open source alongside publication.

Apache UIMA as base

The pivotal architectural choice in cTAKES is to build the system on top of Apache UIMA (Unstructured Information Management Architecture). UIMA was born as an IBM Research project in 2005 and became an Apache Top-Level Project in 2007. It is a Java framework for building modular NLP pipelines, with:

  • Common data model CAS (Common Analysis Structure), where each annotator adds its annotations
  • Type System definable via XML to declare annotation classes (Token, Sentence, NamedEntity, UmlsConcept, DiseaseDisorderMention, MedicationMention, …)
  • Architecture of Annotators chained into Aggregate Analysis Engines
  • Collection handling via CollectionReader and CAS Consumer for I/O

UIMA is the same framework that IBM Watson uses in its pipelines, and is currently the industrial reference for information extraction systems. cTAKES inherits UIMA’s robustness, modularity and scalability — essential for processing real clinical volumes.

cTAKES’s pipeline

cTAKES exposes a sequence of UIMA annotators, each responsible for adding a layer of information to the CAS:

Sentence Detector

Based on OpenNLP, segments the document into sentences. Clinically-specific: handles common abbreviations (Dr., mg, q.d.) without treating them as sentence endings.

Tokenizer

Splits sentences into tokens with normalisation (numbers, punctuation, symbols). Adapted to clinical text, which has different patterns from general language (dosages “500 mg BID”, values “120/80 mmHg”).

Part-of-Speech Tagger

Labels tokens with POS tag (noun, verb, adjective, …). Model trained on annotated clinical corpora.

Chunker (Shallow Parser)

Identifies noun phrases, verb phrases, prepositional phrases as first-level syntactic units. Not full parsing — chunking is faster and more robust on often-telegraphic clinical text.

Dictionary Lookup (UMLS Concept Mapping)

The core of the system: matching of chunks against UMLS Metathesaurus entries, restricted to clinically relevant Semantic Types (Disorder, Sign or Symptom, Pharmacologic Substance, Procedure, Anatomical Site). Returns CUI, Semantic Type, position. Output is functionally analogous to MetaMap’s but with a different implementation approach.

Named Entity Recognition

Recognition of DiseaseDisorderMention, SignSymptomMention, ProcedureMention, MedicationMention, AnatomicalSiteMention as structured entities. Each entity carries attributes: semantic type, CUI, severity, subject (patient, family member), negation.

Assertion Module

Determines whether a mention is:

  • Positive (present)
  • Negated (not present) — based on Chapman’s NegEx
  • Hypothetical (considered, possible)
  • Historical (past, resolved)
  • Family (attributed to a family member, not the patient)
  • Experiential (subject is the patient, a family member, or someone else)

Dependency Parser

Fine-grained dependency parsing, available as optional module (evolving at this date).

Ground truth and evaluation

The JAMIA publication documents cTAKES performance on Mayo Clinic enterprise clinical datasets, with:

  • F-measure for Named Entity-level concept mapping: 0.715 (exact match) on a gold standard of about 160 Mayo documents
  • Component-wise metrics: POS tagger, chunker, assertion module

Results place the system at a level competitive with MetaMap and MedLEE for the covered domains. cTAKES is not uniformly better — it is better for clinical text in the literal sense (record notes, discharge letters), where it was trained and tuned; MetaMap remains superior for biomedical literature (PubMed/MEDLINE articles), where it was originally developed.

i2b2 and development communities

One of the ecosystems adopting cTAKES is i2b2 (Informatics for Integrating Biology and the Bedside), an NIH consortium based at Partners HealthCare / Harvard. i2b2 organises annual Natural Language Processing Challenges with de-identified clinical datasets — the i2b2 Obesity Challenge 2008 and the i2b2/VA 2010 Concepts, Assertions and Relations Challenge have seen many participants use or build on cTAKES (and the 2010 challenge impacts cTAKES development direction).

Research on clinical note de-identification — removal of personal information to enable sharing — runs in parallel and converges with cTAKES: tools like MITRE’s de-ID work on the same texts with complementary goals.

Use and distribution

cTAKES in 2010 is distributed by Mayo Clinic as open source, with stated intention to move it to the Apache Software Foundation in coming years — formal Apache Incubator transition expected after public codebase stabilisation. The current licence is already oriented to Apache 2.0.

Typical users are:

  • US academic medical centres — Mayo itself, Partners HealthCare, Vanderbilt, Cincinnati Children’s, University of Pittsburgh, Columbia — adopting cTAKES for retrospective analyses over large note volumes
  • Biotech and pharma companies for cohort identification in observational studies and active pharmacovigilance
  • Research groups in medical informatics and translational bioinformatics
  • EHR vendors integrating cTAKES as text analysis component (Epic, Cerner exploring integrations)

Productive use in realtime clinical systems — a record that calls cTAKES on note save to suggest ICD codes — is still limited; most usage is batch analytics on already written documents.

Comparison with MetaMap

AspectMetaMapcTAKES
DeveloperNLMMayo Clinic
Primary domainBiomedical literatureClinical text
ArchitectureProlog + Java APIApache UIMA, Java
PipelineMonolithicModular (UIMA Annotators)
ExtensibilityLimitedHigh (via UIMA)
PerformanceSuboptimal on large volumesOptimised for batch
Clinical-specific NERVia Semantic Type filterExplicitly modelled types
AssertionBasicComplete (positive, negated, hypothetical, historical, family)
LicenceUMLS licenceApache-like open source
CommunityNLM-centricGrowing

The two libraries often coexist in the same project: cTAKES for enterprise clinical text processing, MetaMap for reference biomedical literature, UMLS as common terminological backbone.

Limits

cTAKES 2010 has recognised limits:

  • Language — trained and optimised for English. Adaptation to other languages (Italian, German, Spanish, French) requires component-by-component model replacement and ideally local annotated-corpus training
  • Training corpora — annotators are tuned on Mayo clinical notes (telegraphic, US style); performance degrades on very different texts (paediatric, highly specialist, international)
  • De-identification — cTAKES does not de-identify text; must be paired with dedicated tools (MITRE de-ID, Physionet deid, custom tools)
  • Complex context handling — multi-sentence reasoning (what was said in one sentence, summarised in another) remains an open challenge
  • Temporal reasoning — distinguishing past events from current, future from planned, is partially supported; future versions will give the topic more attention

cTAKES in Italy

As of 2010, applying cTAKES to Italian clinical text is experimental, mostly academic. Barriers are similar to MetaMap’s: absence of a specific clinical Italian tokenizer/POS tagger, lack of large Italian annotated corpora, components awaiting translation/adaptation.

Emerging strategies:

  • Multilingual pipelines that translate Italian into English automatically and apply cTAKES to the English text, with CUI mapping back to the original text
  • UIMA-native adaptation — building parallel UIMA pipelines with Italian components (Italian Tokenizer, POS tagger trained on Italian corpora, dictionary lookup on Italian MeSH, Italian SNOMED CT where available)
  • Hybrid approaches — entity-family-specific Italian NER (ATC drugs, Italian ICD-9-CM diagnoses) alongside cTAKES for the international conceptual core

The theme — production-quality multilingual clinical NLP — is the subject of active research in coming years, with Italian, German, French, Scandinavian groups involved in EU FP7 projects and subsequent Horizon programmes.

Outlook

The cTAKES roadmap in coming years envisages:

  • Entry into the Apache Software Foundation as a Top-Level project — consolidating governance and sustainability
  • Integration with UIMA-AS — asynchronous/distributed version of UIMA for processing larger volumes
  • New annotators — coreference, relation extraction, temporal event extraction
  • Deep learning — the wave of neural models will start changing NLP; cTAKES will need to couple its rule-based and traditional ML (CRF, SVM) components with deep approaches
  • EHR integration — record systems integrating cTAKES as a component with realtime-consumable output for CDS and operational research

The cTAKES model — open source, production-grade, documented in peer-reviewed literature, built on an industrial standard (UIMA) — represents a qualitative jump over the previous generation of biomedical NLP tools, and marks the entry of clinical NLP into a phase of applicative maturity.


References: Guergana K. Savova et al., “Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications”, JAMIA 17(5):507-513 (2010). Apache UIMA (uima.apache.org). OpenNLP. UMLS Metathesaurus (NLM). i2b2 — Informatics for Integrating Biology and the Bedside.

Need support? Under attack? Service Status
Need support? Under attack? Service Status