Aristotle and How Best to Define Things

The great challenge in automating analysis of biomedical data is the fact the people use different words for the same thing and the same word for different things. Having clear, unambiguous definitions, or semantics, is of course critical. I wrote a bit about this in a post on the Interoperability Problem. In today's post, I discuss how to best define things. It turns out Aristotle figured this out long ago. But first I discuss the context in which these definitions matter.

Also in a previous post, I explored using BRIDG as a computable ontology. Since then, I have continued to work on various projects (both within and outside of PhUSE) to represent study information using RDF. I keep running up against the same limitation: the need for a computable study ontology based on clear, unambiguous, computable definitions. For example, in the PhUSE project to represent eligibility criteria using RDF, the criteria need to link to a subject's screening data collected in a study. Where does one link to?

It turns out that the Open Biomedical Ontologies provides an Ontology for Biomedical Investigations (OBI). I'm in the process of evaluating this ontology to see if it meets our needs. I think it holds great promise. One advantage is that the OBI, as are other ontologies that make up the OBO, is based on a single, common reference ontology called the BFO (Basic Formal Ontology). This helps establish interoperability of biomedical data from various sources that are expressed using any of the ontologies in the OBO. I'm in the middle of reading a book on the BFO and I will hopefully have more to say about it in future posts.

In the meantime, to continue our work, we have built a mini human study ontology containing only the sufficient classes and relationships to support these projects. For this I have turned to BRIDG to use existing classes wherever possible. Unfortunately, I keep running into problems with the way many BRIDG classes are defined. I find the current BRIDG definitions don't easily lend themselves to an unambiguous OWL representation.  We have discussed this in previous BRIDG working group meetings. It's clear to me that many BRIDG definitions need to be refined before a useful OWL representation can be developed.

Let me explain with some examples. But before I do so, I ran across an interesting chapter in the BFO book on good practices in developing ontologies. One of them is the principle of applying "Aristotelian Definitions" to concepts. I've never heard of the phrase before and this is what I understand.  An Aristotelian definition is one that has the following form:

S = def. a G that D's.

Where G (for genus) is the immediate parent term of S (for species) in the ontology and D stands for differentia, which is to say D describes what it is about G that makes it an S. Ideally, the differentia itself is described in terms that are already defined in the ontology.

Consider Aristotle's own definition of Human, as described in the BFO book:
human (S) = is an Animal (G) that is rational (D).

As the BFO book points out "...following this Aristotelian definitional structure ensures that the set of definitions in an ontology precisely mirror the hierarchy of greater and lesser generality among its universals."

Now let's take a look at BRIDG definitions relevant to a Study Subject. The latest version 4.0 documents the following:

Any individual living (or previously living) being.

A human being.

An entity of interest, either biological or otherwise.

A physical entity which is the primary unit of operational and/or administrative interest in a study.

For the purposes of discussion, I use the colon (":") to describe a resource in RDF (e.g. an OWL class or property).  

It's clear to me that :Person is a :subClassOf :BiologicEntity. If the :BiologicEntity has a property :species with a value :HomoSapiens then a reasoning engine can conclude that the :BiologicEntity is also a :Person. I see no problem there.  

It's also clear a BRIDG Subject may or may not be a Biologic Entity. But what does it mean for a Subject to be an entity "of interest?" Is a Person that is being considered for recruitment into a trial a Subject? If so, one can define an activity called :RecruitmentActivity and a property called :undergoes and then create an Aristotelian definition that a :Subject is an :Entity that :undergoes a :RecruitmentActivity. This makes sense for a Human Subject but not so much for animals or non-biologic things. How does one resolve this?

One way to sidestep the issue is to define a :subClassOf :Subject called :HumanSubject and create the relationships above only for the HumanSubject class. So the Aristotelian definitions becomes: 

A :HumanSubject is a :Person that undergoes a :RecruitmentActivity. 

So then how is a :HumanSubject related to a :StudySubject? It's not clear, since StudySubjects can be non-human. One way to resolve this is to create a subClassOf :StudySubject called :HumanStudySubject and then one can say that a :HumanStudySubject is a :subClass of :HumanSubject.  More specifically, one can define a property called :participatesIn and say that a :HumanStudySubject is any :HumanSubject that :participatesIn any :Study.  (:participatesIn can be defined as sub-property of :undergoes, i.e. :undergoes any protocol-specified activity, including informed consent or screening). 

So for the purposes of our mini-ontology, we have the following computable, Aristotelian definitions. I'd appreciate feedback on these. 

Any individual living (or previously living) Entity.
(i.e. an :Entity (G) that is living or was previously living (D)) 

A human being. A BiologicEntity that has species homo sapiens.

An Entity, either biologic or otherwise, of interest for investigation in a Study.

An Entity which is the primary unit of operational and/or administrative interest in a study. The StudySubject undergoes (is subjected to) Study-specified activities as described in the StudyProtocol.
(i.e. a :Subject (G) that undergoes/is subjected to Study-specified activities (D) as described in the Study Protocol). 

A StudySubject who is also a Person.

And then it can follow that: 

A HumanStudySubject who satisfies all Study-specific Eligibility Criteria.


Determining Study Eligibility

There is great interest in automating processes to determine eligibility in a study. There are two major use cases that I'm aware of. The first is to assess protocol feasibility and facilitate recruitment into trials. In this case, one would like to query multiple EHR sources to identify potentially eligible subjects. The second use case is during study conduct. During the screening phase, data are collected and used to perform an eligibility analysis. The eligibility rules form the analysis plan for the eligibility analysis, the result of which is a binary outcome: true (eligible) or false (not eligible).  Those who pass continue on to experimental treatment. Automating this analysis would facilitate enrollment and decrease the risk of both false positive and false negative enrollment errors.

Modeling eligibility criteria using RDF is the subject of a project within the PhUSE Semantic Technologies Working Group. I encourage interested individuals to participate in those regular calls to help refine and test our approach. We could use more help. In a previous post, I discussed how using the owl:restriction class can be used to determine eligibility, but it turns out this supports only the simplest criteria. OWL 2 introduces property chains, and we were able to model a few more criteria, but not enough. Here I describe a much more powerful strategy: using SPARQL Inference Notation (SPIN). I credit my colleague Ron Katriel for suggesting we use SPIN. I am now busily soaking up everything I can learn about SPIN. It is quite powerful.

The diagram below illustrated the classic subClass relationship between a Subject and an EligibleSubject. The challenge before the working group is: how does one automate determining which members of the superclass Subject belong in the subClass EligibleSubject using RDF?

Without getting into the details of SPIN, it is possible to write eligibility rules using SPARQL, the query language of the semantic web, and embed them into RDF triples so that the computable eligibility rules exist with the data and can be used to identify eligible subjects.

In the PhUSE working group, we are taking an imaginary study ABCD. It has the following eligibility criteria:
Subject must be female.
Subject must have Diabetes Mellitus for at least 2 years.
Subject must have a positive Hepatitis B Surface Antigen (HBSAg) test
Subject must have a negative RPR test for syphilis.
Subject must have a systolic blood pressure of less than or equal to 140 mmHg.

First we created a simple study ontology to relate subjects with the study and its activities (in this case subject observations). Then we populated it with dummy data for 5 subjects. Then we wrote SPARQL queries for each criterion. This allows us to define a subClass of subjects that meet a criterion ("EligibilityCriterionSubject").
The query to define the subClass of Subjects that have systolic BP <= 140 mgHg looks like this:

    ?this a ec2:SBPLTE140Subject .
    ?this a ec2:Subject_ABCD .
    ?q a ec2:SystolicBP .
    ?r a ec2:SystolicBPOutcome .
    ?this ec2:undergoes ?q .
    ?q ec2:hasOutcome ?r .
    ?r ec2:hasValue ?v .

    FILTER (?v <= 140) .

The query to define the subClass of Subjects with Diabetes Mellitus for at least 2 years looks like this:

    ?this a ec2:DiabetesMellitus2YRSubject .
    ?this a ec2:Subject_ABCD .
    ?q a ec2:DiabetesMellitus .
    ?q ec2:medConditionInterval ?r .
    ?r time:hasBeginning ?s .
    ?s time:inXSDDateTime ?dmstart .
    ?this ec2:hasPersonMedicalCondition ?q .
    ?this ec2:studyInterval ?t .
    ?t time:hasBeginning ?u .
    ?u time:inXSDDateTime ?sstart .
    BIND (smf:duration("y", ?dmstart, ?sstart) AS ?dmdur) .
    FILTER (xsd:integer(?dmdur) >= 2) .


Each query is embedded in an RDF triple of the form:

:Subject_ABCD spin:rule "query" . 

The class of EligibleSubjects for Study ABCD is defined as the owl:intersection of the individual classes for each of the EligibilityCriterionClass. 

When you run inferences using a standard semantic web tool (we are using TopBraidComposer), the system correctly identifies 3 subjects with a positive HepB antigen test, 3 subjects with Diabetes Mellitus (two of which have had the diagnosis >2 yrs, and so forth. The intersection of all these classes results in only one subject (Subject0001) that is a member of all the individual EligibilityCriterionSubject sub-classes and is therefore a member of the EligibleSubject class for that study. 

Since the query is expressed in SPARQL, the ability to perform a distributed query across multiple data sources is straightforward, at least in theory. Of course the data being queried should be in RDF, although I understand there is a way to convert SPARQL to SQL to query local databases. 

We're still working on the ontology for this imaginary study, and plan to add more complex eligibility criteria in the future. For example, we want to add additional timing variables to support rules such as: Subject must have a Systolic BP <=140 measured during the two week screening period.

Please email me or leave a comment below if you wish to see the most current working copy of the study ontology/knowledgebase. It is evolving quickly. 


Managing Study Workflow using RDF

In a previous post, I discussed the owl:Restriction class and how it can be used to define and identify eligible subjects in a study. Here I illustrate other uses of this class to help manage the study workflow.

A computable, or machine-readable and interpretable study protocol is fundamental to automating processes to collect and analyze study data. A computable protocol provides sufficient detail for information systems to determine what is supposed to happen and help identify gaps in protocol execution. It would enhance protocol compliance and enable automation of analyses that look for protocol violations and their impact on the objectives. 

At its core, a study protocol is a collection of activities to be performed on study subjects. It describes what data are collected, and the rules that determine when those activities occur. Current standards fall short in representing this workflow in a computable format. I explore here a possible solution using RDF and OWL. I must emphasize I am not an RDF expert so the discussion is by its nature at a high level. The details are my best attempt to describe how it might work. I hope that experts in semantic web technologies will read and comment on this proposed solution, and weigh in on the possible benefits of this approach. You will see slightly different uses of the owl:Restriction class that I discussed previously. 

The proposed approach is straightforward. If you think of a protocol as a collection of activities, each activity has a rule determining when it can begin. The system must be able to identify which activity is "next" and inform the site what to do next. The system must also have the ability to document digitally when an activity ends (i.e. when the activity is considered complete). Each activity has a unique ID so that it can reference other activities. In RDF this is the URI (uniform resource identifier). So a start rule can say Start B when A is complete and StudyDay is 12-17. There will always be one activity with no start rule, so that's the one the system presents first for execution. As activities are marked complete, the next activity is enabled for execution, and so on. Now for each subject, you throw all of these activities and their associated rules in a bowl and pick out individual activities until you find which one you do next. Computers are great at doing this sort of thing quickly. You repeat until all activities that were supposed to be done are completed. Then the study is over for that subject. The grouping of activities into arms, epochs, elements, etc. for any given subject for analysis purposes can all be derived.

My idea to use RDF to represent this workflow emerged while reading the book “Semantic Web for the Working Ontologist,” (excellent book, by the way. I highly recommend it) I came across an example in Chapter 11 on Basic OWL that describe how to model Questions and Answers in OWL (you'll find it on page 222). As I read it, a light went off. In reality I had to read it several times to fully understand it.  I thought, this example is entirely relevant to clinical trials. The exercise describes a group of questions (activities) that are asked generally in sequence, but some questions are asked only when other questions are completed and answered in a certain way. That is, some questions have prerequisites before the system can ask them. So there are concepts for a Question, an Answered Question, an Enabled Question (ready to be asked by the system). Only when a particular answer is obtained and documented does another question become an Enabled Question. 

So how does this apply to clinical trials? Here's how it might work. Let's review a few common study activities.

  • Obtain informed consent
  • Collect screening clinical observations
  • Determine eligibility
  • Allocate to Treatment (Drug or Placebo)
  • Administer Treatment (Drug)
  • Collect a treatment-related clinical observation on Study Day 14. 
Most rules can typically be expressed as conditional (if… then…) statements, e.g. IF informed consent is signed THEN conduct screening activities. Some activities do not have start rules. For example “obtain informed consent” has no start rule. This activity is performed by default at the start of most studies. Some activities depend on the study day or is triggered by the occurrence of another activity. IF StudyDay = 7 THEN perform a complete blood count. IF <headache occurs> THEN <administer study drug>. Some activities are unplanned and also have no start rules (e.g. unplanned clinical observations in response to an unexpected change in the patient’s medical condition). These are added as the study progresses. Activities can also be grouped and nested. Rules to determine eligibility are in effect rules on whether to start the Allocation activity, since only eligible subjects are allocated to experimental treatment.

The basic structure of study activities can be represented by classes and properties in OWL. The basic schema for the study activities is as follows. Throughout this example I use the namespace s: to refer to elements (classes, properties) that relate to an ontology of a study, and the namespace t: to refer to the elements of the particular example study used here (i.e. instance data).  A particular study will have Study Activities and Study Activity Outcomes.  A StudyActivity is considered completed when an Outcome is documented in the system. An outcome could be "unable to complete" if the subject is lost to follow-up, for example. 

First, we establish the StudyActivity and StudyActivityOutcome as OWL classes.

s:StudyActivity a owl:Class.
s:StudyActivityOutcome a owl:Class.

Typical activities and possible outcomes are the following:

Activity                                                         ActivityOutcome
Informed Consent                                       InformedConsentSigned
Screening                                                    Sub-activities completed
     Hemoglobin                                           14.0 mg/dL
     Hematocrit                                             42%
     Platelet Count                                        300K /cm3
     Fasting Blood Glucose                          102 mg/dL
     Eligibility Assessment                           True (i.e. Eligible Subject)
Randomization                                           Placebo
Substance Administration                          Placebo one daily for 3 months

Before a StudyActivity is performed, the Outcome might be one of several outcome options. We therefore define a class called StudyActivityOutcomeOption as a subclass of StudyActivityOutcome. We define a property called s:hasOption as follows. It describes the list of possible outcomes relevant to that activity. For observations, it defines the value set. 

s:hasOption a owl:ObjectProperty;
           rdfs:domain s:StudyActivity;
           rdfs:range s:StudyActivityOutcomeOption.

When a StudyActivity is complete it has a documented outcome, which is a subclass of StudyActivityOutcomeOption. Another way of saying it is the selected outcome is a subset of the outcome options. So we define a class called StudyActivitySelectedOutcome. We also define a property called s:hasOutcome. If an StudyActivity has a documented outcome, it's considered a CompletedStudyActivity, a subclass of StudyActivity. The RDF looks like this:

s:StudyActivitySelectedOutcome rdfs:subClassOf s:StudyActivityOutcomeOption.
s:CompletedStudyActivity a owl:Class.
s:CompletedStudyActivity rdfs:subClassOf s:StudyActivity.
s:hasOutcome a owl:ObjectProperty;
           rdfs:domain s:CompletedStudyActivity;
           rdfs:range s:StudyActivitySelectedOutcome.

I think s:hasOutcome is a subproperty of s:hasOption. (I defer discussion of sub properties here, but in reviewing the definition of sub properties, I think this is true). 

Because certain activities cannot begin until other activities are complete, we need the concept of a CompletedStudyActivity. This is a subclass of StudyActivity:

One way a study activity is complete (i.e. is a CompletedStudyActivity) is if it has a documented s:StudyActivityOutcome through the :hasOutcome property.  That is it's any StudyActivity that has a triple in the database of the form :StudyActivity :hasOutcome :StudyActivityOption.  Given the meaning of rdfs:domain and rdfs:range for the :hasOutcome property, this makes the study activity a completed study activity and the study activity option a study activity selected outcome. 

A StudyActivity can have one or more subactivities, which themselves are each a StudyActivity. So we define a class called SubActivity as a subclass of StudyActivity, and we define a property called s:hasSubActivity as follows:

s:hasSubActivity a owl:ObjectProperty;
           rdfs:domain s:StudyActivity;
           rdfs:range s:SubActivity.

Any StudyActivity that is a range of the s:hasSubActivity property is automatically a SubActivity.

Another way a StudyActivity is complete is if all of its SubActivities are complete. This can be expressed as a restriction on the values that the s:hasSubActivity property can have: 

s:CompletedStudyActivity owl:equivalentClass [a owl:Restriction;
                 owl:onProperty s:hasSubActivity;
                 owl:allValuesFrom s:CompletedStudyActivity].

I think the use of an allValues owl Restriction here means that a StudyActivity is a CompletedStudyActivity if all of its SubActivities are only CompletedStudyActivities.

Those activities that cannot begin until other activities are complete are described to have Prerequisites. We therefore define a new property called :hasPrerequisite:

s:hasPrerequisite a owl:ObjectProperty;
            rdfs:domain s:StudyActivity;
            rdfs:range s:StudyActivityOutcomeOption.

It's important to define the prerequisite as an outcome option (and not the activity itself) to allow branching. Depending on the selected outcome, different activities may subsequently be enabled for a particular subject.

Any StudyActivity for which all of its prerequisite activities have been completed (ie. have only associated StudyActivitySelectedOutcome(s)) becomes an EnabledStudyActivity. These are the activities that are performed next in a study. EnabledStudyActivity is defined as an owl restriction class on the property :hasPrerequisite.

s:EnabledStudyActivity a owl:Class.
s:EnabledStudyActivity owl:subclassOf s:StudyActivity.
s:EnabledStudyActivity owl:equivalentClass [a owl:Restriction;
           owl:onProperty s:hasPrerequisite;
           owl:allValuesFrom s:StudyActivitySelectedOutcome].

(note lines 2&3 have been corrected since the original posting to address a reader comment)

I think this means that a StudyActivity is an EnabledStudyActivity if all of its prerequisites have documented only StudyActivitySelectedOutcome(s) (i.e. all values for the hasPrerequisite property, wherever it's used, comes from the Study Activity Selected Outcome class.

So let's look at some instance activities for any given subject in my hypothetical study.

  • Obtain informed consent
  • Collect screening clinical observations
  • Determine eligibility
  • Allocate to Drug Treatment
  • Administer Drug Treatment
  • Collect a treatment-related clinical observation on Study Day 14. 
The very first activity, informed consent, has no prerequisites. This can be represented in the systems as the following:

t:InformedConsent a s:StudyActivity.
t:InformedConsent a [a owl:Restriction;
           owl:onProperty s:hasPrerequisite;
           owl:cardinality 0].

It basically says that the InformedConsent activity has no prerequisites. For reasons I don't fully understand, the definition of an EnabledStudyActivity (which is defined using an owl:allValuesFrom restriction) will include the InformedConsent activity because of the way the allValuesFrom restriction is evaluated for empty sets. So the system determines InformedConsent is Enabled and surfaces it in the system for action.

Once the appropriate triple is entered in the database saying that the activity is completed, i.e.,

t:InformedConsentX s:hasOutcome t:InformedConsentSigned.

the system infers it is a CompletedStudyActivity.

Meanwhile, the t:Screening activity has a Prerequisite in the system of a t:InformedConsentSigned activity outcome. As soon as the system detects that the informed consent was signed, t:Screening now infers it is an EnabledActivity and is presented to the study team for action.

Now t:Screening has multiple sub-activities as documented in the database using the s:hasSubActivity property. When every sub-activity has documented triples entered in the database in the form :subactivityX s:hasOutcome t:selected-outcomeX, the system infers that t:Screening is a CompletedStudyActivity.

Now here's one piece I haven't yet figured out. The system needs to automatically generate and enter a triple in the database stating that the screening activity hasOutcome a selected outcome. This is necessary to enable the next study activity: determine eligibility. I welcome any ideas on how to do this.

In summary, it seems possible to use OWL to model an executable study workflow. I would appreciate comments from OWL experts on whether this approach would work. I think it would be informative to get this ontology documented and working using dummy instance data. However, I find the logic rather complex and I honestly am not sure if I have it right. It may very well need substantial modification, but the overall strategy makes sense to me. I'd also like to figure out how to weave in study days in the workflow, as many activities depend on the study day. I welcome thoughts on this issue.


Restriction Classes

As I delve deeper into the world of RDF and OWL, I keep coming across a single OWL class that turns out to be critical in enabling computer-assisted reasoning. This class is called owl:Restriction. Here's my understanding of what this class is and why it's so important in enabling computer assisted reasoning across biomedical data. We need to understand it and use it more in the data we collect and analyze.

In the semantic web, everything is grouped into classes or "groups of things" that share common properties. A member of a class is called an individual (or an instance). Classes can have subclasses (or subsets) that share additional properties with each other, but not necessarily with other members of the superclass.  So the Apple class is a subclass of the Fruit class since every Apple is a Fruit (but not vice versa).

So let's define two classes called Automobile and Corolla. Let's also say Corolla is a subclass of Automobile. In RDF this would look like this:

:Automobile rdf:type owl:Class.
:Corolla rdf:type owl:Class.
:Corolla rdfs:subClassOf :Automobile. 

I use turtle syntax throughout because it's easy for humans to read and understand.

So it turns out my friend Ruth has a Corolla that she calls "Tristan" (she's a big opera lover and names all her cars after Wagnerian characters). So in RDF this would be expressed as:

ruth:Tristan rdf:type :Corolla.

Which simply says there is a thing/resource called Tristan and it's a type of :Corolla; i.e. a member of the :Corolla class. (I'm intentionally avoiding the namespace issue.)

A reasoner can analyze these triples and conclude that Tristan is an Automobile. The following triple (new knowledge) can be inferred:

ruth:Tristan rdf:type :Automobile. 

Now this is rather trivial new knowledge, but bear with me.

The limitation of this example is that if another resource comes along, let's call it Isolde, and someone out there has asserted that Isolde is an Automobile, the computer has no way of reasoning what type of automobile Isolde might be. The computer doesn't understand what property (or properties) of an automobile make it a Corolla, or any other type of automobile.

How does this apply to clinical trials? Replace "Automobile" with "Subject" and "Corolla" with "EligibleSubject" in my example. From the triples I asserted,  I know that Tristan is an EligibleSubject in the trial simply because I said so (in reality, I manually analyzed his screening data and confirmed him to be eligible). But the computer doesn't know why....it just knows that he's an EligibleSubject because someone said so. But when the next Subject, Isolde, comes along, the computer is clueless. Is she eligible too? You'd like the computer to be able to figure it out, right?

So how does owl:Restriction help? owl:Restriction allows us to define new classes based on the properties that individuals in that class share. So let's assume that the study in question is an oral contraceptive study in females. To be considered eligible the Subject must be female. So in RDF we define a property called :sex and we assert:

:Isolde rdf:type :Subject.
:Isolde :sex :Female. 

These triples say Isolde is a Subject and she's female.

But how does the computer know she's eligible for the study? We create a new (nameless) class and "restrict" membership in that class only to females. Then we say only members of that class are EligibleSubjects. Before we see how it looks in RDF, we need to review blank nodes.

A "nameless" resource is allowed in RDF. It's called a blank node or bnode. It basically has no subject. It is represented using brackets by listing the properties that that the blank node has. For example:

[ :sex :Female; 
     :wrote :WutheringHeights]

In plain English, one would say, and a computer would interpret:  "this is a nameless resource who is a female who also wrote Wuthering Heights."

Now let's create a blank node that looks like this:

[a owl:Restriction;
     owl:onProperty :sex;
     owl:hasValue :Female]

In plain English: this is a nameless resource whose members all have a restriction on the property :sex that has value of :Female. It basically defines a class of resources that are females.

So now one can assert the following triple in the protocol:

:EligibleSubject owl:equivalentClass [a owl:Restriction;
                                           owl:onProperty :sex;
                                           owl:hasValue :Female].                                                                            

This says only females are eligible subjects.

So now when :Isolde comes along and the computer sees the following triple saying she's a female:

:Isolde :sex :Female. 

the computer can infer the following new triple:

:Isolde rdf:type :EligibleSubject. 

We've successfully defined the subclass EligibleSubject based on a property they all share in common and now a computer can identify new members of that class. If the computer had access to many individuals and their properties on the web or in EHR systems, this approach can be used for recruitment. This is a hot topic in clinical trials at the moment.

This same strategy can be used in many settings. Consider these triples:

:Drug rdf:type owl:Class.
:EffectiveDrug rdf:type owl:Class.
:EffectiveDrug rdfs:subClassOf :Drug. 

We're asserting the Drug is a class, and EffectiveDrug is a class, which is a subclass of Drug.

By using owl:Restriction and the associated properties owl:onProperty and owl:hasValue, one has the ability to tell a computer what properties of a drug make it an effective drug. This way computers can help identify effective drugs for us. It's a paradigm shift in how we do efficacy evaluations right now, but owl:Restriction makes it possible.

The possibilities are endless. One can imagine many classes such as PureDrug, ExpiredDrug, InvestigationalDrug, MarketedDrug, and basically define them as owl:Restriction classes, allowing computers to automatically determine which class(es) any given drug belongs to.

In a future post, I'll discuss how owl:Restriction can be used to manage study workflow; specifically how it can manage the sequence of study activities as described in the protocol, including support for branching and adaptive designs.


The Interoperability Problem

I've been reading quite a bit about interoperability, or specifically computable semantic interoperability (CSI), and why it's so difficult to achieve in health care. The HIMSS (Healthcare Information Management Systems Society) defines interoperability as:

The ability of different information technology systems and software applications to communicate, exchange data, and use the information that has been exchanged.

In an excellent article by CN Mead, he succinctly defines CSI as unambiguous data exchange.

From my perspective, there seem to be two problems leading to an insufficient degree of CSI.

The first problem is the use of the same name for different concepts. An example would be the name "ventricle," which can mean a heart ventricle or a brain ventricle. Another is the word "drug," which in one context means a good thing (e.g. a medication) or a bad thing (illicit drug).  In a previous post, I discussed how the CDISC SDTM Reference Start Date (RFSTDTC) can mean different things in different studies because the definition is not sufficiently precise. 

The second problem is the use of multiple names for the same concept. An example would be stroke, cerebrovascular accident, and cerebral infarction. We all know how controlled terminologies exist to address these problems.

The reality is we are never going to get away from these two problems. The world is much too large and too diverse to impose the same data standards across a domain as complex as health care. Yet, the "disambiguation" of concepts and their names (or "labels") must occur before meaningful pooling and analysis of data across multiple sources can happen (i.e. before the data are truly computable). One always wants to compare "apples to apples" in any analysis, especially those with public health implications. The current state is that this disambiguation occurs manually, resulting in (no surprise) slowness, inefficiencies, and inconsistencies or errors.

How can computers help? If data were expressed using the Resource Description Framework (RDF), some of the disambiguation can be automated. Here's how. All resources (e.g. concepts) have an Unique Resource Identifier (URI), similar to a URL on the web. In the same way that the Google home page cannot be confused with the Yahoo! home page (because they have different URLs), systems can distinguish between a ventricle in the heart and a ventricle in the brain because they each have a different URI. Two resources with the same name/label are treated as different if their URIs are different.

A little bit harder is how to address problem #2. Here we get into a little bit of philosophy on what it means for two things to be the same. This discussion gets very complicated very quickly. Most will agree that Mary Alice Smith and M.A. Smith are the same individual if they have the same parents and were born on the exact same date and time in the same hospital (assuming here that twins are typically born a few minutes apart). But is Mary Alice Smith, the 10 year old, the same as the Mary Alice Smith the 30 year old? People often agree "I was not the same person when I was 21 as who I am now."

From a practical, I would say computational, perspective, two things might be considered the same if they share the same properties. In RDF, properties are described using predicates and objects. So M.A. Smith may have a predicate called ":birthdate" with object (value) of 1980-01-26T12:15:33. In the semantic web, one can search and identify all predicates and objects for a resource and compare them to the predicates and objects of another resource and see if they match.

When you think about it, this is what we do manually. We inspect the two "things" and compare their properties. If enough of the important properties match, then for all practical purposes they are the same. If properties don't match, it raises the question "are they different?" With time, additional more important properties might be identified that force us to change our conclusion and consider two similar things as different.  Determining "sameness" is not an exact science as it turns out. Consider breast cancer. There was a time when certain types of breast cancers were considered the same disease and treated the same way. Along came more sophisticated testing techniques, starting with estrogen receptor testing and lo and behold, those tumors that were estrogen receptor positive behaved differently when exposed to the estrogen blocker tamoxifen. What was previously considered the same is now different. This happens all the time in medicine as new medical knowledge emerges. We used to think all cholesterol was the same, until LDL-cholesterol and HDL-cholesterol were discovered. It now appears that Parkinson's Disease is likely a collection of different conditions, each behaving differently and having different treatment responses.

The bottom line is expressing the properties of resources (e.g. things) in RDF allows information systems to identify and compare the properties of two resources and make a reasonable (but not perfect) determination of sameness. This assumes that we never have complete information about anything. But, what was previously a manual process can now be automated. As we learn more about a resource, more properties emerge and can be documented in RDF, and determinations of sameness become more accurate.

This is huge from an interoperability standpoint.


Intro to Semantic Web Technology for Clinical Research Data

A while back I put together a slide deck to introduce RDF and OWL to those within the clinical research community who may not familiar with these standards. I've recently updated these slides and I'm now making the slides available on this blog. I welcome any comments or suggestions. Thank you.


Automating the Detection of Adverse Events

Adverse event (AE) reporting in the U.S. is largely voluntary, yet it forms the foundation for the detection of post-marketing safety signals that were not identified in clinical trials. It is generally recognized that only a small percentage of AEs get reported. This is certainly my experience in clinical practice. There simply wasn't time to report them all. I reported only those that were the most serious and clearly not described in labeling.

The deployment of Electronic Health Record (EHR) systems nationwide provides a tremendous opportunity to increase reporting. One area that interests me is the potential for EHRs to automatically detect an AE. This requires an unambiguous and computable definition of an AE. I'm not suggesting that EHRs replace a clinician's role in the process, but the potential to automate many steps that the clinician now performs manually is clearly in the best interest of public health.

In a recent post, I discussed how adverse events (AEs) are defined and modeled in BRIDG. I've been doing more reading on this topic and continue to have discussions with others on this important concept. The existing BRIDG definition closely reflects the definition provided in U.S. federal regulations (see 21 CFR 312.32(a)), which states the following:

Adverse event means any untoward medical occurrence associated with the use of a drug in humans, whether or not considered drug related.

The BRIDG definition is:

Any unfavorable and unintended sign, symptom, disease, or other medical occurrence with a temporal association with the use of a medical product, procedure or other therapy, or in conjunction with a research study, regardless of causal relationship. 

The BRIDG definition is appropriately broader. Both contain the concept "medical occurrence." But what is a medical occurrence? How can an EHR detect "medical occurrences?" We're not quite there yet in establishing a computable definition for an AE. But I think we are close.

The Free Medical Dictionary defines occurrence as "any event or incident." The BRIDG definition includes "unintended sign, symptom," which are clinical observations and I think can be considered  incidents. Can a clinical observation be an adverse event? I think the answer is No. The observation needs to undergo an assessment by a qualified individual, such as a health care provider, to establish that:

  1. The observation is indicative of the presence of a medical condition
  2. The onset (or worsening) of the medical condition occurs after a medical intervention (e.g. drug administration)
Only when these two criteria are met can an one identify an adverse event. What do I mean by a medical condition? The Free Medical Dictionary provides an excellent definition:

medical condition

A disease, illness or injury; any physiologic, mental or psychological condition or disorder (e.g., orthopaedic;visual, speech or hearing impairments; cerebral palsy; epilepsy; muscular dystrophy; multiple sclerosis; cancer; coronary artery disease; diabetes; mental retardation; emotional or mental illness; specific learning disabilities; HIV disease; TB; drug addiction; alcoholism). A biological or psychological state which is within the range of normal human variation is not a medical condition. 

Medical condition is a phrase used in documents for physicians applying to licensing agencies (e.g., state medical boards, malpractice insurancecarriers, third-party payers, etc.), which is used to determine a physician’s physical “suitability” to practise medicine.

What is a medical intervention? I mean any activity (e.g. drug administration, surgery, radiation, device implantation, etc.) undertaken to treat, prevent, cure, mitigate, diagnose, or induce a medical condition.

It is clear that a temporal association with a medical intervention is necessary to establish an AE, so I expect general agreement with the second criterion. Note that a causal relationship is not necessary. The first criterion is where there may be disagreement. Here are a some examples where just relying on an observation to establish an AE is problematic. 

A patient is started on Drug X for a valid indication. The patient has no prior history of hypertension. He also happens to be morbidly obese. A week later, the nurse measure his BP at 155/100 mmHg. Is this an adverse event related to Drug X? I would argue it is not, because an assessment hasn't been done to establish that the patient does indeed have hypertension. In this example, the nurse used a normal sized BP cuff, which is well known to give falsely elevated BP readings in morbidly obese patients. When the BP was repeated using a large BP cuff, the readings were repeatedly within the normal range. 

Let's now say that the patient also had a serum chemistry panel and the serum potassium came back elevated at 5.5 mg/dL (normal for the lab is 3.5-5.0 mg/dL). Is this an AE? Again, for the same reason, an assessment is necessary to establish the presence of an underlying medical condition, in this case hyperkalemia. In this example, examination of the biospecimen sent to the lab indicated the presence of hemolysis. It is well known that hemolysis can spuriously raise a serum potassium measurement due to the high concentrations of intracellular potassium in erythrocytes. The chemistry panel was repeated making sure hemolysis was not present in the biospecimen and the serum potassium was in the normal range. 

The following week the patient was involved in a motor vehicle accident (MVA). Is the accident an adverse event? Again, an assessment is needed. Additional observations might reveal that the patient was sleepy at the time. Hypersomnolence, a new onset medical condition, would be a valid AE that may have precipitated the MVA, but the MVA itself is not the adverse event. 

So my computable definition of an adverse event is:

a new onset medical condition that begins after a medical intervention OR a pre-existing medical condition that worsens after a medical intervention

This definition allows a program in an EHR system to identify adverse events automatically if the medical conditions and interventions are appropriately coded. Clearly there is a time interval component that must also be defined. It would be silly to report a new medical condition that happened years after taking a short-acting drug. One can establish guidelines for a reasonable time interval between the intervention and the medical condition. These should take into consideration things like the pharmacokinetic and pharmacodynamic properties of a drug intervention. Longer time intervals between the intervention and medical condition will increase false positives with regard to the causality assessment, and vice-versa, so the right time interval must be chosen wisely.

From a modeling perspective, I think an Adverse Event is a subclass of a Medical Condition. Medical conditions are the results of assessments of observations, and the definition of an adverse event should be modified accordingly to reflect these relationships. I think this will help pave the way for automatic detection of AEs within EHRs.