2015-11-22

Determining Study Eligibility

There is great interest in automating processes to determine eligibility in a study. There are two major use cases that I'm aware of. The first is to assess protocol feasibility and facilitate recruitment into trials. In this case, one would like to query multiple EHR sources to identify potentially eligible subjects. The second use case is during study conduct. During the screening phase, data are collected and used to perform an eligibility analysis. The eligibility rules form the analysis plan for the eligibility analysis, the result of which is a binary outcome: true (eligible) or false (not eligible).  Those who pass continue on to experimental treatment. Automating this analysis would facilitate enrollment and decrease the risk of both false positive and false negative enrollment errors.

Modeling eligibility criteria using RDF is the subject of a project within the PhUSE Semantic Technologies Working Group. I encourage interested individuals to participate in those regular calls to help refine and test our approach. We could use more help. In a previous post, I discussed how using the owl:restriction class can be used to determine eligibility, but it turns out this supports only the simplest criteria. OWL 2 introduces property chains, and we were able to model a few more criteria, but not enough. Here I describe a much more powerful strategy: using SPARQL Inference Notation (SPIN). I credit my colleague Ron Katriel for suggesting we use SPIN. I am now busily soaking up everything I can learn about SPIN. It is quite powerful.

The diagram below illustrated the classic subClass relationship between a Subject and an EligibleSubject. The challenge before the working group is: how does one automate determining which members of the superclass Subject belong in the subClass EligibleSubject using RDF?


Without getting into the details of SPIN, it is possible to write eligibility rules using SPARQL, the query language of the semantic web, and embed them into RDF triples so that the computable eligibility rules exist with the data and can be used to identify eligible subjects.

In the PhUSE working group, we are taking an imaginary study ABCD. It has the following eligibility criteria:
Subject must be female.
Subject must have Diabetes Mellitus for at least 2 years.
Subject must have a positive Hepatitis B Surface Antigen (HBSAg) test
Subject must have a negative RPR test for syphilis.
Subject must have a systolic blood pressure of less than or equal to 140 mmHg.

First we created a simple study ontology to relate subjects with the study and its activities (in this case subject observations). Then we populated it with dummy data for 5 subjects. Then we wrote SPARQL queries for each criterion. This allows us to define a subClass of subjects that meet a criterion ("EligibilityCriterionSubject").
The query to define the subClass of Subjects that have systolic BP <= 140 mgHg looks like this:

CONSTRUCT {
    ?this a ec2:SBPLTE140Subject .
}
WHERE {
    ?this a ec2:Subject_ABCD .
    ?q a ec2:SystolicBP .
    ?r a ec2:SystolicBPOutcome .
    ?this ec2:undergoes ?q .
    ?q ec2:hasOutcome ?r .
    ?r ec2:hasValue ?v .

    FILTER (?v <= 140) .
}

The query to define the subClass of Subjects with Diabetes Mellitus for at least 2 years looks like this:

CONSTRUCT {
    ?this a ec2:DiabetesMellitus2YRSubject .
}
WHERE {
    ?this a ec2:Subject_ABCD .
    ?q a ec2:DiabetesMellitus .
    ?q ec2:medConditionInterval ?r .
    ?r time:hasBeginning ?s .
    ?s time:inXSDDateTime ?dmstart .
    ?this ec2:hasPersonMedicalCondition ?q .
    ?this ec2:studyInterval ?t .
    ?t time:hasBeginning ?u .
    ?u time:inXSDDateTime ?sstart .
    BIND (smf:duration("y", ?dmstart, ?sstart) AS ?dmdur) .
    FILTER (xsd:integer(?dmdur) >= 2) .

}

Each query is embedded in an RDF triple of the form:

:Subject_ABCD spin:rule "query" . 

The class of EligibleSubjects for Study ABCD is defined as the owl:intersection of the individual classes for each of the EligibilityCriterionClass. 

When you run inferences using a standard semantic web tool (we are using TopBraidComposer), the system correctly identifies 3 subjects with a positive HepB antigen test, 3 subjects with Diabetes Mellitus (two of which have had the diagnosis >2 yrs, and so forth. The intersection of all these classes results in only one subject (Subject0001) that is a member of all the individual EligibilityCriterionSubject sub-classes and is therefore a member of the EligibleSubject class for that study. 

Since the query is expressed in SPARQL, the ability to perform a distributed query across multiple data sources is straightforward, at least in theory. Of course the data being queried should be in RDF, although I understand there is a way to convert SPARQL to SQL to query local databases. 

We're still working on the ontology for this imaginary study, and plan to add more complex eligibility criteria in the future. For example, we want to add additional timing variables to support rules such as: Subject must have a Systolic BP <=140 measured during the two week screening period.

Please email me or leave a comment below if you wish to see the most current working copy of the study ontology/knowledgebase. It is evolving quickly.