Protégé and biomedical ontologies: editor, reasoners and the OBO Foundry

Stanford's Protégé 4.3, the biomedical ontology ecosystem (GO, ChEBI, FMA, HPO), OBO Foundry and NCBO's BioPortal, OWL 2 reasoners (HermiT, Pellet, ELK) and uses in phenotyping, drug discovery and CDS.

Digital HealthR&DOpen Source ProtégéOntologiesOWLSemantic WebOBO FoundryBioPortalNCBOStanfordOpen SourceDigital Health

Ontologies as structure of biomedical knowledge

Terminologies — SNOMED CT, LOINC, ICD — are structured lists of concepts with identifiers, descriptions and hierarchies. Ontologies take a further step: they represent not only concepts but also the relations among them — causes, part of, participates in, regulates — with the formality of a description logic, so that software can reason on the represented domain (infer non-explicit conclusions, check consistency, classify instances).

The biomedical application is natural: a gene ontology says that TP53 is a tumour suppressor protein participating in the cell cycle and regulating apoptosis; a phenotypic ontology says that brachydactyly is a form of finger anomaly that is a limb anomaly. From a diagnosis we can trace is-a to all diagnostic super-classes; from a drug we can navigate the ATC classes; from a phenotype we can find associated diseases.

The main open source ontology editor used in biomedicine is Protégé, developed at Stanford.

Protégé

Protégé was created at the Biomedical Informatics Research Group (today Stanford Center for Biomedical Informatics Research, SCBIR) under the lead of Mark A. Musen starting in the 1980s, originally as a frame-based ontology editor for medical informatics projects. Over time it has undergone several rewrites:

  • Protégé-2000 and Protégé-3.x — frame-based editors, Java, widespread in the 2000s
  • Protégé 4.x — rewrite focused on OWL (Web Ontology Language, W3C Recommendation 2004), architecture based on OWL API
  • Current version 4.3 (released during 2013) with full support for OWL 2 (W3C Recommendation 2009)

The licence is BSD. The code is hosted on Stanford’s public repository and GitHub, with contributions from the global community.

The desktop interface is a Java (Swing) GUI exposing classes, properties (object properties and data properties), instances. It is complemented by WebProtégé, a collaborative web version allowing simultaneous multi-user editing with version control — released in 2010 and continuously improving.

OWL and reasoners

OWL 2 — the current version of Web Ontology Language — allows expressing axioms such as:

  • Subsumption (SubClassOf) — Type 2 diabetes mellitus is a subclass of Diabetes mellitus
  • Equivalence (EquivalentClasses) — formal definitions via restrictions
  • Disjointness (DisjointClasses) — Male and Female disjoint
  • Property restrictions“Pneumonia is an infection involving the lung” expressed as Pneumonia ⊑ Infection ⊓ ∃involves.Lung

A reasoner is an engine that, given an ontology, computes:

  • Taxonomic classification — who is a subclass of whom, recognising implicit relations
  • Consistency checking — verifying that the ontology contains no contradictions
  • Instance classification — given an instance, determining the classes it belongs to
  • Query answering — answering SPARQL queries or DL-queries

Reasoners available for Protégé as of 2013:

  • HermiT — Oxford University (Ian Horrocks group), full OWL 2 DL support
  • Pellet — Complexible (formerly Clark & Parsia), OWL 2 DL, AGPL licence
  • FaCT++ — University of Manchester, historical, C++
  • ELK — Oxford + Ulm, specialised for the OWL 2 EL profile (tractable subset), extremely fast, used for SNOMED CT (in EL++)

Choosing a reasoner is a trade-off between expressiveness (full OWL 2 DL) and performance (tractable OWL 2 EL): in biomedicine, where ontologies are large but expressive restrictions are limited, ELK is often the efficient choice.

OBO Foundry

The biomedical ontology ecosystem is coordinated by the Open Biological and Biomedical Ontologies FoundryOBO Foundry — founded in 2007 by a consortium of biomedical ontology developers (Barry Smith, University at Buffalo; Suzanna Lewis, Berkeley; Michael Ashburner, Cambridge; Michael Bada, Colorado). OBO Foundry sets principles of good ontology construction:

  • Open — free licence, accessible documentation
  • Common carrier format (OBO format or OWL)
  • Uniquely identified — canonical IRIs
  • Versioned
  • Minimum maintenance guaranteed
  • Plurality of users
  • Well-documented
  • Defined scope without overlaps

OBO Foundry conformant ontologies are a curated reference catalogue, currently about 130 ontologies. Some of the most relevant:

  • Gene Ontology (GO) — Gene Ontology Consortium, one of the earliest and most used, describes biological processes, molecular functions, cellular components
  • ChEBI (Chemical Entities of Biological Interest) — EBI, biologically relevant chemical entities
  • Foundational Model of Anatomy (FMA) — University of Washington (Cornelius Rosse), detailed human anatomy
  • Human Phenotype Ontology (HPO) — Charité Berlin (Peter Robinson), human phenotypes with disease associations
  • Disease Ontology (DO) — Washington University, human diseases
  • Uberon — cross-species comparative anatomy
  • ChEBI, Protein Ontology (PRO), Ontology for Biomedical Investigations (OBI), and many others

SNOMED CT — the dominant medical terminology — is distributed as OWL 2 EL and navigable with the ELK reasoner. Protégé allows loading and exploring SNOMED CT, without formal authoring functions (SNOMED CT is edited with dedicated tools by its international curators).

BioPortal

The National Center for Biomedical Ontology (NCBO) — NIH consortium based at Stanford — runs BioPortal (bioportal.bioontology.org), a web repository of biomedical ontologies. BioPortal in 2013 hosts over 500 uploaded ontologies, with concept-based search, visualisation, REST APIs for programmatic access, mapping services between ontologies.

BioPortal is fundamental as a sharing infrastructure: a researcher doesn’t have to download and install a reasoner to explore an ontology; they consult it via browser or API. It also offers automatic annotation — given a text, returns traceable concepts of registered ontologies — complementary to MetaMap for some scenarios.

Clinical applications

Biomedical ontologies find clinical application in several scenarios:

Phenotype matching

A patient with a combination of symptoms is matched — via the Human Phenotype Ontology — against profiles of rare genetic diseases. Tools like Phenomizer (developed at Charité) match phenotypes to candidate diagnoses, useful in rare disease diagnosis where the clinician cannot know all ~7,000 known conditions.

Decision support on guidelines

Clinical guidelines — coded in GLIF, PROforma, or OpenCDS knowledge artifacts — rely on ontologies to define concepts precisely (“chronic kidney disease stage 4”, “ACE inhibitor”).

Literature-based discovery

Navigating relations among entities in ontologies and articles indexed with ontologies allows identifying latent connections (gene → protein → disease → drug).

Drug repurposing

Integration between pharmacological ontologies (ChEBI, DrugBank) and phenotypic/disease ontologies may suggest reuse of existing drugs for new conditions.

Cross-source data integration

Ontologies serve as integration schema for datasets from different labs, records, registries. Technical basis of biomedical linked open data projects.

Conceptual rather than string search — a user searches “inflammatory diseases” and the system expands the query with known subclasses (rheumatoid arthritis, lupus, ulcerative colitis, …) thanks to the ontology.

Italian community

In Italy, as of 2013, the use of Protégé and biomedical ontologies is consolidated in medical informatics research groups (University of Pavia, Turin, Rome La Sapienza, Politecnico di Milano, University of Bari), in pharmacogenomics and rare disease projects. Italian participation in OBO Foundry is present (OBI has Italian contributions, some specialist ontologies have been produced in Italy).

Operational use in Italian clinical records is marginal; biomedical ontologies remain a research tool rather than current clinical practice. The emergence of the health linked data topic — which will receive more attention in coming years with regional FSE evolution and FHIR entry — will open new application spaces.

Outlook

Evolution directions observable in coming years include:

  • Protégé 5.x expected in the coming years, with further improvements to WebProtégé and integration with modern collaboration tools
  • Scale-specialised reasoners — SNOMED CT, NCIt (NCI Thesaurus) require optimisations that ELK and derivatives will keep improving
  • FHIR + ontology integration — FHIR has CodeSystem, ValueSet, ConceptMap that connect naturally with ontologies; bidirectional translation is a working area
  • Clinical Knowledge Graphs — integrated ecosystems of ontologies + instances + reasoner + APIs, leaning on triple stores like Virtuoso, Apache Jena TDB, Stardog
  • Embedding and deep learning — the emergence of neural techniques to represent knowledge (graph embedding, knowledge graph embedding) will begin in coming years to complement pure logical representation

Biomedical ontologies will continue to be a silent infrastructure of medical informatics — present in many tools without being visible to the end user, fundamental to the semantic quality of what the user sees.


References: Protégé 4.3, Stanford University (protege.stanford.edu). Mark A. Musen, Stanford Center for Biomedical Informatics Research. OWL 2 W3C Recommendation (2009). Reasoners HermiT, Pellet, FaCT++, ELK. OBO Foundry (obofoundry.org), founded 2007. BioPortal (bioportal.bioontology.org), NCBO.

Need support? Under attack? Service Status
Need support? Under attack? Service Status