2015-09-30

Restriction Classes

As I delve deeper into the world of RDF and OWL, I keep coming across a single OWL class that turns out to be critical in enabling computer-assisted reasoning. This class is called owl:Restriction. Here's my understanding of what this class is and why it's so important in enabling computer assisted reasoning across biomedical data. We need to understand it and use it more in the data we collect and analyze.

In the semantic web, everything is grouped into classes or "groups of things" that share common properties. A member of a class is called an individual (or an instance). Classes can have subclasses (or subsets) that share additional properties with each other, but not necessarily with other members of the superclass.  So the Apple class is a subclass of the Fruit class since every Apple is a Fruit (but not vice versa).

So let's define two classes called Automobile and Corolla. Let's also say Corolla is a subclass of Automobile. In RDF this would look like this:

:Automobile rdf:type owl:Class.
:Corolla rdf:type owl:Class.
:Corolla rdfs:subClassOf :Automobile. 

I use turtle syntax throughout because it's easy for humans to read and understand.

So it turns out my friend Ruth has a Corolla that she calls "Tristan" (she's a big opera lover and names all her cars after Wagnerian characters). So in RDF this would be expressed as:

ruth:Tristan rdf:type :Corolla.

Which simply says there is a thing/resource called Tristan and it's a type of :Corolla; i.e. a member of the :Corolla class. (I'm intentionally avoiding the namespace issue.)

A reasoner can analyze these triples and conclude that Tristan is an Automobile. The following triple (new knowledge) can be inferred:

ruth:Tristan rdf:type :Automobile. 

Now this is rather trivial new knowledge, but bear with me.

The limitation of this example is that if another resource comes along, let's call it Isolde, and someone out there has asserted that Isolde is an Automobile, the computer has no way of reasoning what type of automobile Isolde might be. The computer doesn't understand what property (or properties) of an automobile make it a Corolla, or any other type of automobile.

How does this apply to clinical trials? Replace "Automobile" with "Subject" and "Corolla" with "EligibleSubject" in my example. From the triples I asserted,  I know that Tristan is an EligibleSubject in the trial simply because I said so (in reality, I manually analyzed his screening data and confirmed him to be eligible). But the computer doesn't know why....it just knows that he's an EligibleSubject because someone said so. But when the next Subject, Isolde, comes along, the computer is clueless. Is she eligible too? You'd like the computer to be able to figure it out, right?

So how does owl:Restriction help? owl:Restriction allows us to define new classes based on the properties that individuals in that class share. So let's assume that the study in question is an oral contraceptive study in females. To be considered eligible the Subject must be female. So in RDF we define a property called :sex and we assert:

:Isolde rdf:type :Subject.
:Isolde :sex :Female. 

These triples say Isolde is a Subject and she's female.

But how does the computer know she's eligible for the study? We create a new (nameless) class and "restrict" membership in that class only to females. Then we say only members of that class are EligibleSubjects. Before we see how it looks in RDF, we need to review blank nodes.

A "nameless" resource is allowed in RDF. It's called a blank node or bnode. It basically has no subject. It is represented using brackets by listing the properties that that the blank node has. For example:

[ :sex :Female; 
     :wrote :WutheringHeights]

In plain English, one would say, and a computer would interpret:  "this is a nameless resource who is a female who also wrote Wuthering Heights."

Now let's create a blank node that looks like this:

[a owl:Restriction;
     owl:onProperty :sex;
     owl:hasValue :Female]

In plain English: this is a nameless resource whose members all have a restriction on the property :sex that has value of :Female. It basically defines a class of resources that are females.

So now one can assert the following triple in the protocol:

:EligibleSubject owl:equivalentClass [a owl:Restriction;
                                           owl:onProperty :sex;
                                           owl:hasValue :Female].                                                                            

This says only females are eligible subjects.

So now when :Isolde comes along and the computer sees the following triple saying she's a female:

:Isolde :sex :Female. 

the computer can infer the following new triple:

:Isolde rdf:type :EligibleSubject. 

We've successfully defined the subclass EligibleSubject based on a property they all share in common and now a computer can identify new members of that class. If the computer had access to many individuals and their properties on the web or in EHR systems, this approach can be used for recruitment. This is a hot topic in clinical trials at the moment.

This same strategy can be used in many settings. Consider these triples:

:Drug rdf:type owl:Class.
:EffectiveDrug rdf:type owl:Class.
:EffectiveDrug rdfs:subClassOf :Drug. 

We're asserting the Drug is a class, and EffectiveDrug is a class, which is a subclass of Drug.

By using owl:Restriction and the associated properties owl:onProperty and owl:hasValue, one has the ability to tell a computer what properties of a drug make it an effective drug. This way computers can help identify effective drugs for us. It's a paradigm shift in how we do efficacy evaluations right now, but owl:Restriction makes it possible.

The possibilities are endless. One can imagine many classes such as PureDrug, ExpiredDrug, InvestigationalDrug, MarketedDrug, and basically define them as owl:Restriction classes, allowing computers to automatically determine which class(es) any given drug belongs to.

In a future post, I'll discuss how owl:Restriction can be used to manage study workflow; specifically how it can manage the sequence of study activities as described in the protocol, including support for branching and adaptive designs.







2015-09-29

The Interoperability Problem

I've been reading quite a bit about interoperability, or specifically computable semantic interoperability (CSI), and why it's so difficult to achieve in health care. The HIMSS (Healthcare Information Management Systems Society) defines interoperability as:

The ability of different information technology systems and software applications to communicate, exchange data, and use the information that has been exchanged.

In an excellent article by CN Mead, he succinctly defines CSI as unambiguous data exchange.

From my perspective, there seem to be two problems leading to an insufficient degree of CSI.

The first problem is the use of the same name for different concepts. An example would be the name "ventricle," which can mean a heart ventricle or a brain ventricle. Another is the word "drug," which in one context means a good thing (e.g. a medication) or a bad thing (illicit drug).  In a previous post, I discussed how the CDISC SDTM Reference Start Date (RFSTDTC) can mean different things in different studies because the definition is not sufficiently precise. 

The second problem is the use of multiple names for the same concept. An example would be stroke, cerebrovascular accident, and cerebral infarction. We all know how controlled terminologies exist to address these problems.

The reality is we are never going to get away from these two problems. The world is much too large and too diverse to impose the same data standards across a domain as complex as health care. Yet, the "disambiguation" of concepts and their names (or "labels") must occur before meaningful pooling and analysis of data across multiple sources can happen (i.e. before the data are truly computable). One always wants to compare "apples to apples" in any analysis, especially those with public health implications. The current state is that this disambiguation occurs manually, resulting in (no surprise) slowness, inefficiencies, and inconsistencies or errors.

How can computers help? If data were expressed using the Resource Description Framework (RDF), some of the disambiguation can be automated. Here's how. All resources (e.g. concepts) have an Unique Resource Identifier (URI), similar to a URL on the web. In the same way that the Google home page cannot be confused with the Yahoo! home page (because they have different URLs), systems can distinguish between a ventricle in the heart and a ventricle in the brain because they each have a different URI. Two resources with the same name/label are treated as different if their URIs are different.

A little bit harder is how to address problem #2. Here we get into a little bit of philosophy on what it means for two things to be the same. This discussion gets very complicated very quickly. Most will agree that Mary Alice Smith and M.A. Smith are the same individual if they have the same parents and were born on the exact same date and time in the same hospital (assuming here that twins are typically born a few minutes apart). But is Mary Alice Smith, the 10 year old, the same as the Mary Alice Smith the 30 year old? People often agree "I was not the same person when I was 21 as who I am now."

From a practical, I would say computational, perspective, two things might be considered the same if they share the same properties. In RDF, properties are described using predicates and objects. So M.A. Smith may have a predicate called ":birthdate" with object (value) of 1980-01-26T12:15:33. In the semantic web, one can search and identify all predicates and objects for a resource and compare them to the predicates and objects of another resource and see if they match.

When you think about it, this is what we do manually. We inspect the two "things" and compare their properties. If enough of the important properties match, then for all practical purposes they are the same. If properties don't match, it raises the question "are they different?" With time, additional more important properties might be identified that force us to change our conclusion and consider two similar things as different.  Determining "sameness" is not an exact science as it turns out. Consider breast cancer. There was a time when certain types of breast cancers were considered the same disease and treated the same way. Along came more sophisticated testing techniques, starting with estrogen receptor testing and lo and behold, those tumors that were estrogen receptor positive behaved differently when exposed to the estrogen blocker tamoxifen. What was previously considered the same is now different. This happens all the time in medicine as new medical knowledge emerges. We used to think all cholesterol was the same, until LDL-cholesterol and HDL-cholesterol were discovered. It now appears that Parkinson's Disease is likely a collection of different conditions, each behaving differently and having different treatment responses.

The bottom line is expressing the properties of resources (e.g. things) in RDF allows information systems to identify and compare the properties of two resources and make a reasonable (but not perfect) determination of sameness. This assumes that we never have complete information about anything. But, what was previously a manual process can now be automated. As we learn more about a resource, more properties emerge and can be documented in RDF, and determinations of sameness become more accurate.

This is huge from an interoperability standpoint.




2015-09-22

Intro to Semantic Web Technology for Clinical Research Data

A while back I put together a slide deck to introduce RDF and OWL to those within the clinical research community who may not familiar with these standards. I've recently updated these slides and I'm now making the slides available on this blog. I welcome any comments or suggestions. Thank you.