2017-10-13

Managing Study Workflow using the Resource Description Framework (RDF)

Imagine a study conduct tool where:

  1. One enters the result of an observation and the tool immediately lets you know what observation(s) to collect next
  2. The tool immediately performs a data quality check and creates an alert if there is a problem
  3. One enters the results of all the screening tests and the tool immediately tells you whether the subject is eligible to continue in the trial
  4. If the subject is not eligible, it tells you why he or she failed screening.
  5. In an adaptive trial design, it analyzes the observations in real time and informs you what protocol-specified modifications one can make in the subject's treatment plan
These are all possible today if we start representing study data including the protocol, using the RDF. 


In my last post, I spoke about a paper on this topic. I am now sharing that paper so those interested can take a look at it. I'm also posting the slides I presented at the annual PhUSE meeting in Edinburgh, Scotland (Oct 8-11, 2017), which was a very successful meeting by the way. A record number of attendees (695) participated in the conference.

I hope you will find it interesting. Please send me your comments. Thank you.

Paper: Managing Study Workflow using the Resource Description Framework

Slides: Managing Study Workflow


2017-08-28

Eligibility Criteria, Screen Failures, and another RDF Success Story

It's T minus 6 weeks  (approximately) for the PhUSE 13th Annual Conference in Edinburgh, Scotland and I'm beaming with excitement. I'm involved in two study data projects using RDF and both are going very well. The first one, in collaboration with Tim Williams and an enthusiastic project team of PhUSE volunteers is called Clinical Trials Data in RDF, which among its various goals, will demonstrate how study data in RDF can be used to automatically generate highly standards conforming, submission quality SDTM datasets.

But it's the second paper that I want to discuss today. It's called "Managing Study Workflow Using the RDF." The paper is in pre-publication status so I can't share today, but I plan to post a copy here after the conference. I include the Abstract below.

In a nutshell, the paper describes how one can represent study activity start rules using SPIN (SPARQL Inference Notation), a type of RDF, to identify which study activity(-ies) are to be performed next based on what activities have already been performed. Well it turns out that the start rule for the Randomization activity in a typical double blind trial is in fact the Eligibility Criteria for the study. Here it is, in an executable form that, when combined with a standard COTS (commercial off the shelf) RDF inferencing engine can automatically determine eligibility. How cool is that?

A typical eligibility rule consists of multiple subrules all of which themselves must be TRUE for the overall rule to be true (e.g. AGE must be >=18 years AND RPR must be negative AND Pregnancy Test must be negative AND etc.); exclusion criteria can be negated and added as a subrule. The ontology also describes how to skip subrules that can logically be skipped (e.g. Pregnancy Test must be negative in a male subject). The end result is that identifying an Eligible Subject is automatic and performed simply by entering the screening test results in the knowledgebase. (Think of a knowledgebase as an RDF database).

Without going into the details (wait for the paper!), the rule points to all the screening activities that matter, checks each one for the expected outcome/result, and returns a TRUE or FALSE response if the conditions of the rule are or are not met. If the rule outcome is TRUE, the subject is eligible and the Randomization activity is enabled. If the rule is FALSE, then just the opposite. The paper describes the data from eight hypothetical subjects that were screened for a hypothetical study with just a few screening tests/criteria. The ontology correctly came up with the correct eligibility outcome for all eight.

But there is more....by adding a few more simple SPIN rules to the ontology, the inferencing engine can readily provide a list of all Screen Failures, and the tests that caused them to fail. It can also identify the tests that were logically skipped and therefore ignored for eligibility testing purposes. Do you want to determine which Screening Failure subjects received study medication? Another SPIN rule can do that too. The possibilities are quite exciting. It makes RDF, in my humble opinion, a strong candidate for representing clinical trial data during study conduct. No other standard that I know of supports this kind of automation "out of the box." in RDF, the model and the implementation of the model are the same!! And, once one is ready to submit, you press another button, and submission quality SDTM datasets are generated (which the first project I mentioned intends to demonstrate).

For more details, contact me, or wait until after the PhUSE meeting in October for the full paper.

ABSTRACT
A clinical study is fundamentally a collection of activities that are performed according to protocol-specified rules. It is not unusual for a single subject to undergo hundreds of study-related activities. Managing this workflow is a formidable challenge. The investigator must ensure that all activities are conducted at the right time and in the correct sequence, sometimes skipping activities that logically need not be done. It is not surprising that errors occur.


This paper explores the use of the Resource Description Framework (RDF) and related standards to automate the management of a study workflow. It describes how protocol information can be expressed in the RDF in a computable way, such that an information system can easily identify which activities have been performed, determine which activities should be performed next, and which can be logically skipped. The use of this approach has the potential to improve how studies are conducted, resulting in better compliance and better data.

2017-08-23

Quality Data in Clinical Trials, Part 2

It's been two years since I wrote about quality data in clinical trials. As I re-read that post now, I agree with most of what I said, but it's time to update my thinking based on experience gained since then with study data validation processes.

I made the point that there are two types of validation rules: conformance rules (to data standards) and business rules, a.k.a. data quality checks. I had suggested that conformance rules are best managed by the standards development organization, the fact is that sponsors and FDA support multiple standards (SDTM, MedDRA, CDISC Terminology, WHO Drug Dictionary) so it's up to FDA to manage the collective set of conformance standards across the broad data standards landscape.

The division between conformance rules and business rules is still quite important. They serve different functions. Ensuring conformance to standards enable automation. Ensuring data quality enable meeting the study objectives. One can assess data quality on legacy data. It is a slow, manual process. Standardized data enable automated data quality checks than can more easily uncover serious data content issues that can impede analysis and review.

As a former FDA reviewer, and a big proponent of data standards, I can honestly say that FDA reviewers care very little about data standards issues. All they care about is that the data be sufficiently standardized so they can run standard automated analyses on the data. The analyses drive the level of standardization needed by the organization. These analyses include the automated data quality checks. One cannot determine if AGE < 0  (a data quality check), if AGE is called something else or is located in the wrong domain (conformance rule).

It's like driving a car. You want to get from point A to point B quickly (data quality issues), you don't really care what's under the hood (standards conformance issues). That is for mechanics (or data analysts) to worry about.

FDA now has a robust service to assess study "Data Fitness" (being described as data that are fit for use). Data Fitness combines both conformance and business rules. They are not split, and the reviewer is left to deal with data conformance issues, which they care little about as there is generally a manual work-around, along with the data quality issues, which is of most important to them and have the biggest impact on the review. Combining the two is a mistake. I believe Data Fitness as a concept should be retired and the service split into two: Standards Conformance, and Data Quality. The Data Quality assessment service should only be performed on data that have passed the minimum level of conformance needed by the organization. If a study fails conformance testing, it wasn't standardized properly and those errors need to be corrected. In the new era requiring the use of data standards, FDA reviewers should not be burdened with data that do not pass a minimum level of conformance validation.

Consider this hypothetical scenario as an example to drive home my point. FDA requires sponsors to submit a study report supporting the safety and effectiveness of a drug. The report should be submitted digitally using the PDF standard.  The report arrives and the file cannot be opened using the review tool (i.e. Acrobat) because of 10 errors in PDF implementation (not realistic in today's day and age, but possible nonetheless). Those 10 errors are provided in a validation report to the reviewer for action. The reviewer doesn't care about the technical details of implementing PDF. They want a file that opens and is readable within Acrobat. All can agree that the reviewer should not be burdened evaluating and managing standards non-conformance issues.

If you replace study report with study data, and PDF with SDTM, this scenario is exactly what is happening today. But somehow that practice remains acceptable. Why? Well because there are other "tools" (akin to a simple text editors in the document world) that allow reviewers to work with non-conformant data, albeit at much reduced efficiency. These "workarounds" for non-standard study data are all too prevalent and acceptable. With time this needs to change to take full advantage of standardized data for both Sponsor and FDA alike.

My future state for standardized study data submissions look like this: study data arrives, they undergo standards conformance validation using pass/fail criteria. Those that pass go to the reviewer and the regulatory review clock starts. Those that fail are returned for correction. (The conformance rules are publicly available so that conformance errors can be identified and corrected before submission.) During the filing period, automated data quality checks are performed and that report goes to the reviewer. Deficiencies result in possible information requests. Serious data quality deficiencies may form the basis of a refuse to file action.

Finally, let's retire use of the term "DataFit" in favor of what we really mean: Standards Conformance or Data Quality. Let's not muddle these two important issues any longer.


2017-06-26

The Semantic Web Way of Thinking

Before I knew anything about the Resource Description Framework (RDF) and the Semantic Web, I would say that I had a traditional way of thinking about the world. Take a clinical trial for instance. First you have a research idea or hypothesis, then you design a trial to test that hypothesis, then you write the protocol, recruit subjects, conduct the trial, analyze the data, and reach conclusions. All of these steps are important but I failed to see any commonality in any fundamental way. For example, writing the protocol and analyzing the data are very different processes needing very different skill sets. How could they possibly be similar?

Enter the RDF and suddenly everything is related in some way with everything else. It may be obvious but no less profound to notice that everything in the world is a type of "Thing." The way we come to understand the world via the scientific process is to classify Things, group Things, separate Things into different buckets or classes based on their properties. Here's an obvious example, written in pseudo-RDF-turtle syntax.

A Car is a subClassOf Thing.
A Red Car is a subClassOf Car.

How does a computer know a car is a red car? Well, one can define a property of Car called Color and one of the options for Color is Red.

A Red Car is [any Car with Color value = Red].

So I can ask the computer to find every Red Car and it knows to look for those with Color property is Red. This is very straightforward, almost to the point of being insultingly simple. But wait...

To make scientific discoveries, we first identify what properties are important for the type of Thing we are studying, and we measure those properties. Let's say you have an investigational drug A and you want to know if it's effective for Multiple Sclerosis. You have the following assertions.

DrugA is a subClassOf Thing.
EffectiveDrug is a subClassOf Drug.

How can a computer discover that DrugA is an EffectiveDrug?  In the same way as the Red Car example, there are properties of DrugA that semantic web tools can analyze to determine that the drug is an EffectiveDrug. Sometimes those properties are difficult to define, or difficult to measure, but the principle is the same.

So, getting back to the different steps in the lifecycle of a study, they are also Things that we can call Activities. There are rules that determine when Activities begin and end, and rules that determine which Activity is performed next. One can define a property of Activities called State or Status (e.g. not yet started, ongoing, completed, aborted). So the lifecycle of a study is broken down to a series of activities, each with its own properties: hasState; hasStart Rule; hasEnd Rule. Suddenly processes that look very different now look very similar.

This is the semantic web way of thinking. The universe is made up of things: similar things and different things. All are grouped together and distinguished from one another by their properties. The challenge is identifying those properties that matter, documenting them, and using semantic web tools to do the grouping and sorting for us. This is how the Semantic Web can work for us and help us make new discoveries.




2017-05-18

Common Clinical Terms expressed in OWL

One of the goals in establishing precise Aristotelian definitions for common clinical terms is to make them computable, i.e. express them in such a way that computers and information systems can reason across data and "understand" that Thing123 is a Medical Condition and Thing456 is a Symptom and can begin to infer new medical knowledge for us. The Web Ontology Language, OWL, is ideally suited for this task. I'm not an OWL expert, but I think it would be useful to explore what some of these terms look like in OWL and consider the implications of computable definitions.

How does computer-assisted reasoning (also called inferencing) work? A simple example is the property :subClass. The colon here is merely to remind me that this is a resource expressed somewhere on the world-wide-web. I have left out the namespace for simplicity. If you state that :Apple is a :subClass of :Fruit and :McIntosh is a :subClass of :Apple, then a computer can infer that :McIntosh is a subClass of :Fruit (new knowledge). This is trivial inferencing of course, but OWL supports much more sophisticated inferencing capabilities about which I have only begun to appreciate. One very important class in OWL is the owl:Restriction class, a sub-class of owl:Class. A restriction class is one whose membership is restricted based on certain properties that the individual member has. I wrote about Restriction classes in a previous post. We can use this class to make certain definitions computable. Without restriction classes, we have to manually assign members to a class. For example the :Dog class has no meaning to a computer. We have to manually assign Fido, Spot, Rocket, and Buddy to the :Dog class. :Fido is a :Dog is true only because someone said it's true.  But restriction classes create in effect rules that say only Things with certain features/properties are :Dogs. Once we establish these computable definitions, then computers can do the assignments for us.

Let's look at an example from the clinical research domain: Persons that participate in a trial. Consider this taxonomy (i.e. superclass/subclass hierarchy). As one reads this, a member of a lower class is automatically a member of the next higher class, and so forth all the way to the top class. This makes the rdfs:subClassOf property a "transitive" property.

--BiologicEntity
    -- Person
          --HumanStudySubject
              --EligibleSubject
              --EnrolledSubject

And now some working definitions (taken from various sources and documented here)

BiologicEntity
Any individual living (or previously living) Entity.
(i.e. an :Entity that is living or was previously living) 

Person
A human being. A BiologicEntity that has species = homo sapiens. 

HumanStudySubject
A :Person that undergoes/is subjected to Study-specified activities as described in the Study Protocol

And then it can follow that: 

EligibleSubject
A HumanStudySubject who satisfies all Study-specific Eligibility Criteria.

Now assume that every BiologicEntity has a property called :species and only Persons have a :species property value = homo sapiens. I can express the definition of a Person as the following in OWL

:Person  a      owl:Class ;
        owl:equivalentClass  [ a                  owl:Restriction ;
                               owl:hasValue  species:homo_sapiens ;
                               owl:onProperty     :species ; ] ;

Now let's define a class called :StudyActivity, containing any protocol-specified activity belonging to a specific human study. We now define the following

study:HumanStudySubject  a      owl:Class ;
        owl:equivalentClass  [ a                  owl:Restriction ;
                               owl:someValuesFrom  study:StudyActivity ;
                               owl:onProperty     study:participatesIn ;] ;

This says a HumanStudySubject is any Thing that participates in some StudyActivity.
So anywhere there is an RDF Triple that says :Person :participatesIn :StudyActivity_104, then that Person is automatically inferred to be a :HumanStudySubject.

This approach is not without some notable pitfalls. One's logic must be squeaky clean. Take this example: A Dog has Four Legs. Now if we mistakenly convert that to mean a Dog is any Thing with Four Legs (which clearly is wrong in English), you get the following OWL expression:

:Dog  a      owl:Class ;
        owl:equivalentClass  [ a                  owl:Restriction ;
                               owl:hasValue "4"ˆˆxsd:int ;  
                               owl:onProperty     :hasLegs ;] ;

So somewhere in a database is the triple:  :Morris :hasLegs "4"ˆˆxsd:int .

Well, guess what, Morris is Cat. But based on the OWL definition of a Dog, an information system will conclude that Morris is a Dog. So one must be careful which properties one selects to define membership in a restriction class.

The more I think about this approach, the more it makes sense. How do we currently distinguish two Things with the same name, e.g. Mustang (the car vs. the horse)? Easy. By the properties that each Thing has. One is a biological entity, the other is a machine; one has 4 legs and tail, the other has an engine and 4 wheels. It makes sense to define members of a class by describing the properties that each member must have. This principle of making definitions of clinical terms computable is a key component to less ambiguous clinical trial data. Using this same approach, we can enable computers to identify members of other useful and interesting restriction classes, such as :EligibleSubject, :EffectiveDrug,  :DangerousDrug, :PoorQualityDrug.

The possibilities can be very exciting.






2017-05-15

Activity Rules in Clinical Trials

I remain very interested in modeling Activities in clinical trials. Activities are performed according to Rules that are specified in the Protocol. I am searching how to best model these rules. Here is one thought.  All Activities have a Start Rule. Rules are Analyses in our mini-study ontology since one Analyzes the data from other Activities ("Prerequisite Activities") to determine if and when the target Activity can take place.

Let's start with the easiest Activity (from a modeling perspective): Obtaining Informed Consent. It has a Start Rule that says "begin at any time." The Rule is automatically satisfied by default so has a RuleOutcome always set to TRUE.   There are no preconditions other than a Subject's willingness to participate in this Activity. The Activity completes with an ActivityOutcome = InformedConsent_GRANTED.  The ActivityStatus is now Complete. The RDF looks something like this:

:Person1 :participatesIn :InformedConsent1 .
:InformedConsent1 :hasStartRule :Rule_DEFAULT .
:Rule_DEFAULT :activityOutcome "TRUE"ˆˆxsd:string
:InformedConsent1 :hasPerformedDate "2017-04-01"ˆˆxsd:date;   (the date the activity was performed)
           :activityStatus :activitystatus_CO;        (the Activity is completed)
           :activityOutcome :informed consent_GRANTED .

Let's assume it's a simple trial that has only three screening activities: DemographicDataCollection, FastingBloodSugar, and a serum PregnancyTest (if female).

We create a new Rule which says :  RuleOutcome is TRUE if the prerequisite activity is complete and has a certain outcome. Generically it looks like this.

:PrerequisiteOutcomeRule  rdfs:subClassOf :Rule .
          :hasPrerequisite    :Activity ;
          :hasPrerequisiteStatus  :activitystatus_CO;    (the prerequisite activity must be complete)
          :hasPrerequisiteOutcome :ActivityOutcome .  (the prerequisite activity must have a certain outcome)
(there are optional properties one can consider here, like :delay , which specifies a duration that one must wait before the target activity can begin after the prerequisite activity ends.)

The instance data looks like this:

:Person1 :participatesIn :DemographicDataCollection1 .
:DemographicDataCollection1 :hasStartRule  :PrerequisiteOutcomeRule1 .
:PrerequisiteOutcomeRule1 :hasPrerequisite :InformedConsent1;
          :hasPrerequisiteStatus :activitystatus_CO;
          :hasPrerequisiteOutcome :informed consent_GRANTED.

Once this information is recorded, and the rule satisfied, the Activity becomes a PlannedActivity.

An embedded spin:rule can check the PrerequisiteActivity and return a value of TRUE if the conditions are met:

 :PrerequisiteOutcomeRule1 :activityOutcome "TRUE"ˆˆxsd:string .

Once the RuleOutcome is evaluated as TRUE, another spin:rule can set the scheduled date of the DemographicDataCollection equal to the date the InformedConsent was done (plus any delay specified in the Rule). The Activity now becomes a ScheduledActivity.

The same Rule can be applied to the FastBloodSugar activity.

PregnancyTest requires a new rule. It can begin with the DemographicDataCollection Activity is Complete and the Outcome is :sex_Female.

Now the very cool part. These same rules can be used to check for Eligibility. Why? Because an eligibility criterion is nothing more than a Start Rule which defines when the RandomizationActivity can begin.

Similarly we can define rules when Visits, Elements, Epochs can begin. The Study now becomes a graph of StudyActivities, all waiting to begin, but only those that have Rules with RuleOutcome=TRUE are ready to go next. I think this paradigm will hold for even the most complex adaptive designs. It is something worth testing.

Finally, an existential question. Is ObtainingInformedConsent a StudyActivity?  If a HumanSubject (a Person of Interest) participatesIn the ObtainInformedConsent Activity, but does NOT grant informed consent, is he/she considered to have participated in the Study? If so, what would their disposition be? Screen Failure doesn't seem right since no screening tests were conducted. Is ObtainingInformedConsent a ScreeningActivity? If so, then failure to obtain informed consent could be considered a screen failure.

Please share your thoughts, and especially if you have other ideas on how model activity rules. 

2017-03-26

Activities in Clinical Trials, part 2

This post is a sequel to one I recently posted about Activities in Clinical Trials. As I mentioned in that post, clinical trials are fundamentally a group of many Activities and the rules that describe when they are performed, grouped, and analyzed. The recently launched PhUSE SDTM Data in RDF project is developing a mini Study Ontology to represent clinical data using RDF in a way that will make it easier, we think, to generate high quality SDTM-compliant datasets. Our Study Ontology   recognizes that all Activities have Outcomes. In the case of an Observation, it is the result. We examine results of Observation today in detail. The ontology needs to represent Observations and their Outcomes in a highly consistent and semantically precise manner, yet it should be flexible enough to accommodate all Observations.

So we have the basic premise in the Ontology that:

:Activity :hasOutcome :ActivityOutcome.

What does that look like for various Observations? When we look at results of Observations, we basically see two types: categorical results and numeric results. The categorical results can be controlled terminology, The numeric results have a value and often, but not always, a unit.  There may also be a free text description of the results, but that's easy to add and we won't consider it further today. So now we have:

:ActivityOutcome :hasValue "<number>"
:ActivityOutcome hasUnit :Unit .
:ActivityOutcome :hasCodedTerm :ActivityOutcomeCode .

Also let's acknowledge that Observations may have components, or SubActivities:
:Activity :hasSubActivity :Activity .

The ontology looks like this:


Here are some simple examples. The first is Age. (a more detailed model would link to the method for obtaining the age. Is it collected by asking the subject his/her age, or is it derived from the birthdate).



The next example is a lab test, RPR test for syphilis.


The next one is BP, the most complex, as it has two sub activities: SBP and DBP. 



The big question is will this work for the vast majority of observations out there? I'm not sure but it is worth testing. I'm optimistic that it will handle most. I'm intentionally leaving out important details like data types, methods, provenance information. I think these can all be addressed relatively easily.

It is worth noting that many "tests" contain both observations and and assessment of those observations by a qualified professional. Histopathology and Radiology tests are the most common. The report is divided into two sections as a result. The first describes the findings, and the second section describes the Assessment (i.e. interpretation) of those findings, often resulting in a diagnosis and/or further characterization of an existing medical condition. Representation of Assessment information in the ontology is a discussion left for another day. 

In an upcoming blog, I discuss Rules that determine when Activities are conducted. These include Eligibility Criteria. 

Thank you for your comments.