Contents

A production-quality clinical NLP infrastructure
Apache UIMA as base
cTAKES’s pipeline
Ground truth and evaluation
i2b2 and development communities
Use and distribution
Comparison with MetaMap
Limits
cTAKES in Italy
Outlook

AgenticHealth

On-premise clinical platform with local LLMs, RAG on FHIR/DICOM data, diagnostic support, remote follow-up. Architecture designed for the MDR pathway.

Discover AgenticHealth →

Digital Health

Medical software development compliant with CE and MDR regulatory standards. Clinical decision support systems, AI integration in clinical workflows.

Discover →

A production-quality clinical NLP infrastructure

The biomedical Natural Language Processing tool landscape as of 2010 includes NLM’s MetaMap for concept extraction, Columbia’s MedLEE (not Open Source), numerous unmaintained academic projects. Missing is a production-grade Open Source tool specifically designed for clinical — not biomedical in general — text, with the completeness of a modern NLP pipeline: tokenisation, morphosyntactic analysis, biomedical NER, assertion (positive/negated/hypothetical), UMLS integration.

Mayo Clinic has answered this need, internally developing since 2006 an NLP system to analyse millions of clinical notes from its enterprise information system. The system, called cTAKES — clinical Text Analysis and Knowledge Extraction System — was released Open Source in March 2009 under the Open Health NLP Consortium (OHNLP) by Guergana Savova and Christopher Chute’s group at Mayo Clinic, and later published in JAMIA in 2010 (Journal of the American Medical Informatics Association, Savova et al. 2010).

Apache UIMA as base

The pivotal architectural choice in cTAKES is to build the system on top of Apache UIMA (Unstructured Information Management Architecture). UIMA was born as an IBM Research project in 2005 and became an Apache Top-Level Project in 2007. It is a Java framework for building modular NLP pipelines, with:

Common data model CAS (Common Analysis Structure), where each annotator adds its annotations
Type System definable via XML to declare annotation classes (Token, Sentence, NamedEntity, UmlsConcept, DiseaseDisorderMention, MedicationMention, …)
Architecture of Annotators chained into Aggregate Analysis Engines
Collection handling via CollectionReader and CAS Consumer for I/O

UIMA is the same framework that IBM Watson uses in its pipelines, and is currently the industrial reference for information extraction systems. cTAKES inherits UIMA’s robustness, modularity and scalability — essential for processing real clinical volumes.

cTAKES’s pipeline

cTAKES exposes a sequence of UIMA annotators, each responsible for adding a layer of information to the CAS:

Sentence Detector

Based on OpenNLP, segments the document into sentences. Clinically-specific: handles common abbreviations (Dr., mg, q.d.) without treating them as sentence endings.

Tokenizer

Splits sentences into tokens with normalisation (numbers, punctuation, symbols). Adapted to clinical text, which has different patterns from general language (dosages “500 mg BID”, values “120/80 mmHg”).

Part-of-Speech Tagger

Labels tokens with POS tag (noun, verb, adjective, …). Model trained on annotated clinical corpora.

Chunker (Shallow Parser)

Identifies noun phrases, verb phrases, prepositional phrases as first-level syntactic units. Not full parsing — chunking is faster and more robust on often-telegraphic clinical text.

Dictionary Lookup (UMLS Concept Mapping)

The core of the system: matching of chunks against UMLS Metathesaurus entries, restricted to clinically relevant Semantic Types (Disorder, Sign or Symptom, Pharmacologic Substance, Procedure, Anatomical Site). Returns CUI, Semantic Type, position. Output is functionally analogous to MetaMap’s but with a different implementation approach.

Named Entity Recognition

Recognition of DiseaseDisorderMention, SignSymptomMention, ProcedureMention, MedicationMention, AnatomicalSiteMention as structured entities. Each entity carries attributes: semantic type, CUI, severity, subject (patient, family member), negation.

Assertion Module

Determines whether a mention is:

Positive (present)
Negated (not present) — based on Chapman’s NegEx
Hypothetical (considered, possible)
Historical (past, resolved)
Family (attributed to a family member, not the patient)
Experiential (subject is the patient, a family member, or someone else)

Dependency Parser

Fine-grained dependency parsing, available as optional module (evolving at this date).

Ground truth and evaluation

The JAMIA publication documents cTAKES performance on Mayo Clinic enterprise clinical datasets, with:

F-measure for Named Entity-level concept mapping: 0.715 (exact match) on a gold standard of about 160 Mayo documents
Component-wise metrics: POS tagger, chunker, assertion module

Results place the system at a level competitive with MetaMap and MedLEE for the covered domains. cTAKES is not uniformly better — it is better for clinical text in the literal sense (record notes, discharge letters), where it was trained and tuned; MetaMap remains superior for biomedical literature (PubMed/MEDLINE articles), where it was originally developed.

i2b2 and development communities

One of the ecosystems adopting cTAKES is i2b2 (Informatics for Integrating Biology and the Bedside), an NIH consortium based at Partners HealthCare / Harvard. i2b2 organises annual Natural Language Processing Challenges with de-identified clinical datasets — the i2b2 Obesity Challenge 2008 and the i2b2/VA 2010 Concepts, Assertions and Relations Challenge have seen many participants use or build on cTAKES (and the 2010 challenge impacts cTAKES development direction).

Research on clinical note de-identification — removal of personal information to enable sharing — runs in parallel and converges with cTAKES: tools like MITRE’s de-ID work on the same texts with complementary goals.

Use and distribution

cTAKES in 2010 is distributed by Mayo Clinic as Open Source, with stated intention to move it to the Apache Software Foundation in coming years — formal Apache Incubator transition expected after public codebase stabilisation. The current licence is already oriented to Apache 2.0.

Typical users are:

US academic medical centres — Mayo itself, Partners HealthCare, Vanderbilt, Cincinnati Children’s, University of Pittsburgh, Columbia — adopting cTAKES for retrospective analyses over large note volumes
Biotech and pharma companies for cohort identification in observational studies and active pharmacovigilance
Research groups in medical informatics and translational bioinformatics
EHR vendors integrating cTAKES as text analysis component (Epic, Cerner exploring integrations)

Productive use in realtime clinical systems — a record that calls cTAKES on note save to suggest ICD codes — is still limited; most usage is batch analytics on already written documents.

Comparison with MetaMap

Aspect	MetaMap	cTAKES
Developer	NLM	Mayo Clinic
Primary domain	Biomedical literature	Clinical text
Architecture	Prolog + Java API	Apache UIMA, Java
Pipeline	Monolithic	Modular (UIMA Annotators)
Extensibility	Limited	High (via UIMA)
Performance	Suboptimal on large volumes	Optimised for batch
Clinical-specific NER	Via Semantic Type filter	Explicitly modelled types
Assertion	Basic	Complete (positive, negated, hypothetical, historical, family)
Licence	UMLS licence	Apache-like Open Source
Community	NLM-centric	Growing

The two libraries often coexist in the same project: cTAKES for enterprise clinical text processing, MetaMap for reference biomedical literature, UMLS as common terminological backbone.

Limits

cTAKES 2010 has recognised limits:

Language — trained and optimised for English. Adaptation to other languages (Italian, German, Spanish, French) requires component-by-component model replacement and ideally local annotated-corpus training
Training corpora — annotators are tuned on Mayo clinical notes (telegraphic, US style); performance degrades on very different texts (paediatric, highly specialist, international)
De-identification — cTAKES does not de-identify text; must be paired with dedicated tools (MITRE de-ID, Physionet deid, custom tools)
Complex context handling — multi-sentence reasoning (what was said in one sentence, summarised in another) remains an open challenge
Temporal reasoning — distinguishing past events from current, future from planned, is partially supported; future versions will give the topic more attention

cTAKES in Italy

As of 2010, applying cTAKES to Italian clinical text is experimental, mostly academic. Barriers are similar to MetaMap’s: absence of a specific clinical Italian tokenizer/POS tagger, lack of large Italian annotated corpora, components awaiting translation/adaptation.

Emerging strategies:

Multilingual pipelines that translate Italian into English automatically and apply cTAKES to the English text, with CUI mapping back to the original text
UIMA-native adaptation — building parallel UIMA pipelines with Italian components (Italian Tokenizer, POS tagger trained on Italian corpora, dictionary lookup on Italian MeSH, Italian SNOMED CT where available)
Hybrid approaches — entity-family-specific Italian NER (ATC drugs, Italian ICD-9-CM diagnoses) alongside cTAKES for the international conceptual core

The theme — production-quality multilingual clinical NLP — is the subject of active research in coming years, with Italian, German, French, Scandinavian groups involved in EU FP7 projects and subsequent Horizon programmes.

Outlook

The cTAKES roadmap in coming years envisages:

Entry into the Apache Software Foundation as a Top-Level project — consolidating governance and sustainability
Integration with UIMA-AS — asynchronous/distributed version of UIMA for processing larger volumes
New annotators — coreference, relation extraction, temporal event extraction
Deep learning — the wave of neural models will start changing NLP; cTAKES will need to couple its rule-based and traditional ML (CRF, SVM) components with deep approaches
EHR integration — record systems integrating cTAKES as a component with realtime-consumable output for CDS and operational research

The cTAKES model — Open Source, production-grade, documented in peer-reviewed literature, built on an industrial standard (UIMA) — represents a qualitative jump over the previous generation of biomedical NLP tools, and marks the entry of clinical NLP into a phase of applicative maturity.

References: Guergana K. Savova et al., “Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications”, JAMIA 17(5):507-513 (2010). Apache UIMA (uima.apache.org). OpenNLP. UMLS Metathesaurus (NLM). i2b2 — Informatics for Integrating Biology and the Bedside.

Company

Actions

Links

Products

Solutions

Industries

cTAKES: Open Source clinical NLP from Mayo Clinic on Apache UIMA