Managing Study Workflow using the Resource Description Framework (RDF)

Imagine a study conduct tool where:

  1. One enters the result of an observation and the tool immediately lets you know what observation(s) to collect next
  2. The tool immediately performs a data quality check and creates an alert if there is a problem
  3. One enters the results of all the screening tests and the tool immediately tells you whether the subject is eligible to continue in the trial
  4. If the subject is not eligible, it tells you why he or she failed screening.
  5. In an adaptive trial design, it analyzes the observations in real time and informs you what protocol-specified modifications one can make in the subject's treatment plan
These are all possible today if we start representing study data including the protocol, using the RDF. 

In my last post, I spoke about a paper on this topic. I am now sharing that paper so those interested can take a look at it. I'm also posting the slides I presented at the annual PhUSE meeting in Edinburgh, Scotland (Oct 8-11, 2017), which was a very successful meeting by the way. A record number of attendees (695) participated in the conference.

I hope you will find it interesting. Please send me your comments. Thank you.

Paper: Managing Study Workflow using the Resource Description Framework

Slides: Managing Study Workflow


Eligibility Criteria, Screen Failures, and another RDF Success Story

It's T minus 6 weeks  (approximately) for the PhUSE 13th Annual Conference in Edinburgh, Scotland and I'm beaming with excitement. I'm involved in two study data projects using RDF and both are going very well. The first one, in collaboration with Tim Williams and an enthusiastic project team of PhUSE volunteers is called Clinical Trials Data in RDF, which among its various goals, will demonstrate how study data in RDF can be used to automatically generate highly standards conforming, submission quality SDTM datasets.

But it's the second paper that I want to discuss today. It's called "Managing Study Workflow Using the RDF." The paper is in pre-publication status so I can't share today, but I plan to post a copy here after the conference. I include the Abstract below.

In a nutshell, the paper describes how one can represent study activity start rules using SPIN (SPARQL Inference Notation), a type of RDF, to identify which study activity(-ies) are to be performed next based on what activities have already been performed. Well it turns out that the start rule for the Randomization activity in a typical double blind trial is in fact the Eligibility Criteria for the study. Here it is, in an executable form that, when combined with a standard COTS (commercial off the shelf) RDF inferencing engine can automatically determine eligibility. How cool is that?

A typical eligibility rule consists of multiple subrules all of which themselves must be TRUE for the overall rule to be true (e.g. AGE must be >=18 years AND RPR must be negative AND Pregnancy Test must be negative AND etc.); exclusion criteria can be negated and added as a subrule. The ontology also describes how to skip subrules that can logically be skipped (e.g. Pregnancy Test must be negative in a male subject). The end result is that identifying an Eligible Subject is automatic and performed simply by entering the screening test results in the knowledgebase. (Think of a knowledgebase as an RDF database).

Without going into the details (wait for the paper!), the rule points to all the screening activities that matter, checks each one for the expected outcome/result, and returns a TRUE or FALSE response if the conditions of the rule are or are not met. If the rule outcome is TRUE, the subject is eligible and the Randomization activity is enabled. If the rule is FALSE, then just the opposite. The paper describes the data from eight hypothetical subjects that were screened for a hypothetical study with just a few screening tests/criteria. The ontology correctly came up with the correct eligibility outcome for all eight.

But there is more....by adding a few more simple SPIN rules to the ontology, the inferencing engine can readily provide a list of all Screen Failures, and the tests that caused them to fail. It can also identify the tests that were logically skipped and therefore ignored for eligibility testing purposes. Do you want to determine which Screening Failure subjects received study medication? Another SPIN rule can do that too. The possibilities are quite exciting. It makes RDF, in my humble opinion, a strong candidate for representing clinical trial data during study conduct. No other standard that I know of supports this kind of automation "out of the box." in RDF, the model and the implementation of the model are the same!! And, once one is ready to submit, you press another button, and submission quality SDTM datasets are generated (which the first project I mentioned intends to demonstrate).

For more details, contact me, or wait until after the PhUSE meeting in October for the full paper.

A clinical study is fundamentally a collection of activities that are performed according to protocol-specified rules. It is not unusual for a single subject to undergo hundreds of study-related activities. Managing this workflow is a formidable challenge. The investigator must ensure that all activities are conducted at the right time and in the correct sequence, sometimes skipping activities that logically need not be done. It is not surprising that errors occur.

This paper explores the use of the Resource Description Framework (RDF) and related standards to automate the management of a study workflow. It describes how protocol information can be expressed in the RDF in a computable way, such that an information system can easily identify which activities have been performed, determine which activities should be performed next, and which can be logically skipped. The use of this approach has the potential to improve how studies are conducted, resulting in better compliance and better data.


Quality Data in Clinical Trials, Part 2

It's been two years since I wrote about quality data in clinical trials. As I re-read that post now, I agree with most of what I said, but it's time to update my thinking based on experience gained since then with study data validation processes.

I made the point that there are two types of validation rules: conformance rules (to data standards) and business rules, a.k.a. data quality checks. I had suggested that conformance rules are best managed by the standards development organization. The fact is that sponsors and FDA support multiple standards (SDTM, MedDRA, CDISC Terminology, WHO Drug Dictionary) so it's up to FDA to manage the collective set of conformance rules across the broad data standards landscape with regard to regulatory study data submissions.

The division between conformance rules and business rules is still quite important. They serve different functions. Ensuring conformance to standards enables automation. Ensuring data quality enable meeting the study objectives. One can assess data quality on legacy data. It is a slow, manual process. Standardized data enable automated data quality checks than can more easily uncover serious data content issues that can impede analysis and review.

As a former FDA reviewer, and a big proponent of data standards, I can honestly say that FDA reviewers care very little about data standards issues. Their overriding concern is that the data be sufficiently standardized so they can run standard automated analyses on the data. The analyses drive the level of standardization needed by the organization. These analyses include the automated data quality checks. One cannot determine if AGE < 0  (a data quality check), if AGE is called something else or is located in the wrong domain (conformance rule).

It's like driving a car. You want to get from point A to point B quickly (minimize data quality issues), you don't really care what's under the hood (standards conformance issues). That is for mechanics (or data analysts) to worry about.

FDA now has a robust service to assess study "Data Fitness" (being described as data that are fit for use). Data Fitness combines both conformance and business rules. They are not split, and the reviewer is left to deal with data conformance issues, which they care little about, as they can be quite technical and there is generally a manual work-around, along with the data quality issues, which are of most importance to them and have the biggest impact on the review. Combining the two is a mistake. I believe Data Fitness as a concept should be retired and the service split into two: Standards Conformance, and Data Quality. The Data Quality assessment service should only be performed on data that have passed the minimum level of conformance needed by the organization. If a study fails conformance testing, it wasn't standardized properly and those errors need to be corrected. In the new era requiring the use of data standards, FDA reviewers should not be burdened with data that do not pass a minimum level of data standards conformance.

Consider this hypothetical scenario as an example to drive home my point. FDA requires sponsors to submit a study report supporting the safety and effectiveness of a drug. The report should be submitted digitally using the PDF standard.  The report arrives and the file cannot be opened using the review tool (i.e. Acrobat) because of 10 errors in PDF implementation (not realistic in today's day and age, but possible nonetheless). Those 10 errors are provided in a validation report to the reviewer for action. The reviewer doesn't care about the technical details of implementing PDF. They want a file that opens and is readable within Acrobat. Let us all agree that the reviewer should not be burdened evaluating and managing standards non-conformance issues.

If you replace study report with study data, and PDF with SDTM, this scenario is exactly what is happening today. But somehow that practice remains acceptable. Why? Well because there are other "tools" (akin to a simple text editors in the document world) that allow reviewers to work with non-conformant data, albeit at much reduced efficiency. These "workarounds" for non-standard study data are all too prevalent and acceptable. With time this needs to change to take full advantage of standardized data for both the Sponsor and FDA alike.

My future state for standardized study data submissions look like this: study data arrive, they undergo standards conformance validation using pass/fail criteria. Those that pass go to the reviewer and the regulatory review clock starts. Those that fail are returned for correction. (The conformance rules are publicly available so that conformance errors can be identified and corrected before submission.) During the filing period, automated data quality checks are performed and that report goes to the reviewer. Deficiencies result in possible information requests. Serious data quality deficiencies may form the basis of a refuse to file action.

Finally, let's retire use of the term "DataFit" in favor of what we really mean: Standards Conformance or Data Quality. Let's not muddle these two important issues any longer.


The Semantic Web Way of Thinking

Before I knew anything about the Resource Description Framework (RDF) and the Semantic Web, I would say that I had a traditional way of thinking about the world. Take a clinical trial for instance. First you have a research idea or hypothesis, then you design a trial to test that hypothesis, then you write the protocol, recruit subjects, conduct the trial, analyze the data, and reach conclusions. All of these steps are important but I failed to see any commonality in any fundamental way. For example, writing the protocol and analyzing the data are very different processes needing very different skill sets. How could they possibly be similar?

Enter the RDF and suddenly everything is related in some way with everything else. It may be obvious but no less profound to notice that everything in the world is a type of "Thing." The way we come to understand the world via the scientific process is to classify Things, group Things, separate Things into different buckets or classes based on their properties. Here's an obvious example, written in pseudo-RDF-turtle syntax.

A Car is a subClassOf Thing.
A Red Car is a subClassOf Car.

How does a computer know a car is a red car? Well, one can define a property of Car called Color and one of the options for Color is Red.

A Red Car is [any Car with Color value = Red].

So I can ask the computer to find every Red Car and it knows to look for those with Color property is Red. This is very straightforward, almost to the point of being insultingly simple. But wait...

To make scientific discoveries, we first identify what properties are important for the type of Thing we are studying, and we measure those properties. Let's say you have an investigational drug A and you want to know if it's effective for Multiple Sclerosis. You have the following assertions.

DrugA is a subClassOf Thing.
EffectiveDrug is a subClassOf Drug.

How can a computer discover that DrugA is an EffectiveDrug?  In the same way as the Red Car example, there are properties of DrugA that semantic web tools can analyze to determine that the drug is an EffectiveDrug. Sometimes those properties are difficult to define, or difficult to measure, but the principle is the same.

So, getting back to the different steps in the lifecycle of a study, they are also Things that we can call Activities. There are rules that determine when Activities begin and end, and rules that determine which Activity is performed next. One can define a property of Activities called State or Status (e.g. not yet started, ongoing, completed, aborted). So the lifecycle of a study is broken down to a series of activities, each with its own properties: hasState; hasStart Rule; hasEnd Rule. Suddenly processes that look very different now look very similar.

This is the semantic web way of thinking. The universe is made up of things: similar things and different things. All are grouped together and distinguished from one another by their properties. The challenge is identifying those properties that matter, documenting them, and using semantic web tools to do the grouping and sorting for us. This is how the Semantic Web can work for us and help us make new discoveries.


Common Clinical Terms expressed in OWL

One of the goals in establishing precise Aristotelian definitions for common clinical terms is to make them computable, i.e. express them in such a way that computers and information systems can reason across data and "understand" that Thing123 is a Medical Condition and Thing456 is a Symptom and can begin to infer new medical knowledge for us. The Web Ontology Language, OWL, is ideally suited for this task. I'm not an OWL expert, but I think it would be useful to explore what some of these terms look like in OWL and consider the implications of computable definitions.

How does computer-assisted reasoning (also called inferencing) work? A simple example is the property :subClass. The colon here is merely to remind me that this is a resource expressed somewhere on the world-wide-web. I have left out the namespace for simplicity. If you state that :Apple is a :subClass of :Fruit and :McIntosh is a :subClass of :Apple, then a computer can infer that :McIntosh is a subClass of :Fruit (new knowledge). This is trivial inferencing of course, but OWL supports much more sophisticated inferencing capabilities about which I have only begun to appreciate. One very important class in OWL is the owl:Restriction class, a sub-class of owl:Class. A restriction class is one whose membership is restricted based on certain properties that the individual member has. I wrote about Restriction classes in a previous post. We can use this class to make certain definitions computable. Without restriction classes, we have to manually assign members to a class. For example the :Dog class has no meaning to a computer. We have to manually assign Fido, Spot, Rocket, and Buddy to the :Dog class. :Fido is a :Dog is true only because someone said it's true.  But restriction classes create in effect rules that say only Things with certain features/properties are :Dogs. Once we establish these computable definitions, then computers can do the assignments for us.

Let's look at an example from the clinical research domain: Persons that participate in a trial. Consider this taxonomy (i.e. superclass/subclass hierarchy). As one reads this, a member of a lower class is automatically a member of the next higher class, and so forth all the way to the top class. This makes the rdfs:subClassOf property a "transitive" property.

    -- Person

And now some working definitions (taken from various sources and documented here)

Any individual living (or previously living) Entity.
(i.e. an :Entity that is living or was previously living) 

A human being. A BiologicEntity that has species = homo sapiens. 

A :Person that undergoes/is subjected to Study-specified activities as described in the Study Protocol

And then it can follow that: 

A HumanStudySubject who satisfies all Study-specific Eligibility Criteria.

Now assume that every BiologicEntity has a property called :species and only Persons have a :species property value = homo sapiens. I can express the definition of a Person as the following in OWL

:Person  a      owl:Class ;
        owl:equivalentClass  [ a                  owl:Restriction ;
                               owl:hasValue  species:homo_sapiens ;
                               owl:onProperty     :species ; ] ;

Now let's define a class called :StudyActivity, containing any protocol-specified activity belonging to a specific human study. We now define the following

study:HumanStudySubject  a      owl:Class ;
        owl:equivalentClass  [ a                  owl:Restriction ;
                               owl:someValuesFrom  study:StudyActivity ;
                               owl:onProperty     study:participatesIn ;] ;

This says a HumanStudySubject is any Thing that participates in some StudyActivity.
So anywhere there is an RDF Triple that says :Person :participatesIn :StudyActivity_104, then that Person is automatically inferred to be a :HumanStudySubject.

This approach is not without some notable pitfalls. One's logic must be squeaky clean. Take this example: A Dog has Four Legs. Now if we mistakenly convert that to mean a Dog is any Thing with Four Legs (which clearly is wrong in English), you get the following OWL expression:

:Dog  a      owl:Class ;
        owl:equivalentClass  [ a                  owl:Restriction ;
                               owl:hasValue "4"ˆˆxsd:int ;  
                               owl:onProperty     :hasLegs ;] ;

So somewhere in a database is the triple:  :Morris :hasLegs "4"ˆˆxsd:int .

Well, guess what, Morris is Cat. But based on the OWL definition of a Dog, an information system will conclude that Morris is a Dog. So one must be careful which properties one selects to define membership in a restriction class.

The more I think about this approach, the more it makes sense. How do we currently distinguish two Things with the same name, e.g. Mustang (the car vs. the horse)? Easy. By the properties that each Thing has. One is a biological entity, the other is a machine; one has 4 legs and tail, the other has an engine and 4 wheels. It makes sense to define members of a class by describing the properties that each member must have. This principle of making definitions of clinical terms computable is a key component to less ambiguous clinical trial data. Using this same approach, we can enable computers to identify members of other useful and interesting restriction classes, such as :EligibleSubject, :EffectiveDrug,  :DangerousDrug, :PoorQualityDrug.

The possibilities can be very exciting.


Activity Rules in Clinical Trials

I remain very interested in modeling Activities in clinical trials. Activities are performed according to Rules that are specified in the Protocol. I am searching how to best model these rules. Here is one thought.  All Activities have a Start Rule. Rules are Analyses in our mini-study ontology since one Analyzes the data from other Activities ("Prerequisite Activities") to determine if and when the target Activity can take place.

Let's start with the easiest Activity (from a modeling perspective): Obtaining Informed Consent. It has a Start Rule that says "begin at any time." The Rule is automatically satisfied by default so has a RuleOutcome always set to TRUE.   There are no preconditions other than a Subject's willingness to participate in this Activity. The Activity completes with an ActivityOutcome = InformedConsent_GRANTED.  The ActivityStatus is now Complete. The RDF looks something like this:

:Person1 :participatesIn :InformedConsent1 .
:InformedConsent1 :hasStartRule :Rule_DEFAULT .
:Rule_DEFAULT :activityOutcome "TRUE"ˆˆxsd:string
:InformedConsent1 :hasPerformedDate "2017-04-01"ˆˆxsd:date;   (the date the activity was performed)
           :activityStatus :activitystatus_CO;        (the Activity is completed)
           :activityOutcome :informed consent_GRANTED .

Let's assume it's a simple trial that has only three screening activities: DemographicDataCollection, FastingBloodSugar, and a serum PregnancyTest (if female).

We create a new Rule which says :  RuleOutcome is TRUE if the prerequisite activity is complete and has a certain outcome. Generically it looks like this.

:PrerequisiteOutcomeRule  rdfs:subClassOf :Rule .
          :hasPrerequisite    :Activity ;
          :hasPrerequisiteStatus  :activitystatus_CO;    (the prerequisite activity must be complete)
          :hasPrerequisiteOutcome :ActivityOutcome .  (the prerequisite activity must have a certain outcome)
(there are optional properties one can consider here, like :delay , which specifies a duration that one must wait before the target activity can begin after the prerequisite activity ends.)

The instance data looks like this:

:Person1 :participatesIn :DemographicDataCollection1 .
:DemographicDataCollection1 :hasStartRule  :PrerequisiteOutcomeRule1 .
:PrerequisiteOutcomeRule1 :hasPrerequisite :InformedConsent1;
          :hasPrerequisiteStatus :activitystatus_CO;
          :hasPrerequisiteOutcome :informed consent_GRANTED.

Once this information is recorded, and the rule satisfied, the Activity becomes a PlannedActivity.

An embedded spin:rule can check the PrerequisiteActivity and return a value of TRUE if the conditions are met:

 :PrerequisiteOutcomeRule1 :activityOutcome "TRUE"ˆˆxsd:string .

Once the RuleOutcome is evaluated as TRUE, another spin:rule can set the scheduled date of the DemographicDataCollection equal to the date the InformedConsent was done (plus any delay specified in the Rule). The Activity now becomes a ScheduledActivity.

The same Rule can be applied to the FastBloodSugar activity.

PregnancyTest requires a new rule. It can begin with the DemographicDataCollection Activity is Complete and the Outcome is :sex_Female.

Now the very cool part. These same rules can be used to check for Eligibility. Why? Because an eligibility criterion is nothing more than a Start Rule which defines when the RandomizationActivity can begin.

Similarly we can define rules when Visits, Elements, Epochs can begin. The Study now becomes a graph of StudyActivities, all waiting to begin, but only those that have Rules with RuleOutcome=TRUE are ready to go next. I think this paradigm will hold for even the most complex adaptive designs. It is something worth testing.

Finally, an existential question. Is ObtainingInformedConsent a StudyActivity?  If a HumanSubject (a Person of Interest) participatesIn the ObtainInformedConsent Activity, but does NOT grant informed consent, is he/she considered to have participated in the Study? If so, what would their disposition be? Screen Failure doesn't seem right since no screening tests were conducted. Is ObtainingInformedConsent a ScreeningActivity? If so, then failure to obtain informed consent could be considered a screen failure.

Please share your thoughts, and especially if you have other ideas on how model activity rules. 


Activities in Clinical Trials, part 2

This post is a sequel to one I recently posted about Activities in Clinical Trials. As I mentioned in that post, clinical trials are fundamentally a group of many Activities and the rules that describe when they are performed, grouped, and analyzed. The recently launched PhUSE SDTM Data in RDF project is developing a mini Study Ontology to represent clinical data using RDF in a way that will make it easier, we think, to generate high quality SDTM-compliant datasets. Our Study Ontology   recognizes that all Activities have Outcomes. In the case of an Observation, it is the result. We examine results of Observation today in detail. The ontology needs to represent Observations and their Outcomes in a highly consistent and semantically precise manner, yet it should be flexible enough to accommodate all Observations.

So we have the basic premise in the Ontology that:

:Activity :hasOutcome :ActivityOutcome.

What does that look like for various Observations? When we look at results of Observations, we basically see two types: categorical results and numeric results. The categorical results can be controlled terminology, The numeric results have a value and often, but not always, a unit.  There may also be a free text description of the results, but that's easy to add and we won't consider it further today. So now we have:

:ActivityOutcome :hasValue "<number>"
:ActivityOutcome hasUnit :Unit .
:ActivityOutcome :hasCodedTerm :ActivityOutcomeCode .

Also let's acknowledge that Observations may have components, or SubActivities:
:Activity :hasSubActivity :Activity .

The ontology looks like this:

Here are some simple examples. The first is Age. (a more detailed model would link to the method for obtaining the age. Is it collected by asking the subject his/her age, or is it derived from the birthdate).

The next example is a lab test, RPR test for syphilis.

The next one is BP, the most complex, as it has two sub activities: SBP and DBP. 

The big question is will this work for the vast majority of observations out there? I'm not sure but it is worth testing. I'm optimistic that it will handle most. I'm intentionally leaving out important details like data types, methods, provenance information. I think these can all be addressed relatively easily.

It is worth noting that many "tests" contain both observations and and assessment of those observations by a qualified professional. Histopathology and Radiology tests are the most common. The report is divided into two sections as a result. The first describes the findings, and the second section describes the Assessment (i.e. interpretation) of those findings, often resulting in a diagnosis and/or further characterization of an existing medical condition. Representation of Assessment information in the ontology is a discussion left for another day. 

In an upcoming blog, I discuss Rules that determine when Activities are conducted. These include Eligibility Criteria. 

Thank you for your comments. 


Extensible Code Lists: an RDF Solution

We are all familiar with code lists, or value sets as they are also called. These are permissible values for a variable. For example Race, as described by the U.S. Office of Management and Budget (OMB), can have 5 permissible values: White, Black or African American, Asian, American Indian or Alaska Native, and Native Hawaiian or other Pacific Islander.

Some standard code lists are incomplete, i.e. they don't capture the universe of possible values for a variable. These code lists are called extensible. The Sponsor may create custom terms and add them to the code list. Managing these is a challenge. Here is an idea that we are proposing for the PhUSE SDTM Data in RDF project that we are kicking off next week at the PhUSE Computational Science Symposium.  RDF has a unique advantage over other solutions in that it is designed to work with data that are distributed across the web. It can be used to integrate multiple dictionaries from multiple sources. Here's one way it can work. 

First one creates a study terminology ontology containing all the standard terminology concepts needed for clinical trials. It looks something like this:

One can see how to leverage other terminologies. For example, the Vital Signs class links to SDTM terminology expressed in RDF. In this case the resources shown here are for Diastolic Blood Pressure.

Now you create a second ontology for custom terms, which looks very similar to the first one:

In this example, the sponsor performed three custom flags for subjects who completed 8, 16, and 24 weeks of treatment, respectively. These are entered as custom:PopulationFlag analyses. Next, one imports the standard terminology ontology and specifies using the rdfs:subClassOf property that the custom terms are sub-classes of the standard concepts. So now it looks like this:

Looking at the code:PopulationFlag example, there are three standard population flags specified: Efficacy (EFF), Safety (SAF), and Intent to Treat (ITT). Furthermore there are the three custom flags as previously described.

The nice thing about this approach is that the custom terms exist independently from the standard terms and can be easily removed/ignored for the next study, yet they can be linked in this way to the standard terms so tools treat them the same. A SPARQL query looking for all members of the code:PopulationFlag class will return 6 individuals. For the next study, one can create a different set of custom terms. The "web" of study terminologies begins to look like the figure below. One can imagine a diverse library of controlled terms all available for implementation almost literally at one's fingertips.

One can link to other terminologies in the same way. Ideally, all the standard ontologies exist on the web and one merely links to them, thereby taking advantage of Linked Data principles.

I appreciate your comments. 


Temporal Concepts in Clinical Trials

As the saying goes, "timing is everything." This is no less true in clinical trials because knowing when activities occurred or how long they last often holds the key to proper interpretation of the data. Documenting temporal elements for activities in clinical trials is therefore crucial.

In the RDF world, we can leverage the work of others who have thought about this issue in great detail. It turns out that the World Wide Web Consortium (www.w3c.org), has developed a time ontology in OWL for anyone to use. It's rather simple and elegant and has useful application for our Study Ontology. Linking our study ontology with the w3c time ontology is a nice example of the benefits of Linked Data. The ontology goes like this....

A temporal entity can be either a time:Instant (a single point in time) or a time:Interval (has a duration). Intervals have properties like time:hasBeginning and time:hasEnd. These are not totally disjoint because one can consider an Instant as an Interval where the start and end Instants are the same, but this is a minor point.

For many Activities, such as a blood test or a vital signs measurement, all we really care about is the date/time it occurred. For all practical purposes, a time:Instant.  Some activities do have a duration worth knowing about, so one can attach a time:Interval to them. The nice thing about Intervals is that it links the beginning instant to the end instant ... they go together. The time:Interval resource is the link that holds them together.

So let's look at some examples taken from the SDTM of some important Intervals and how they might look in the RDF when we link to the w3c time ontology. As always, I use Turtle syntax as it's very human-readable:

study:ReferenceStudyInterval rdf:type time:Interval;
     time:hasBeginning sdtm:RFSTDTC ;
     time:hasEnd sdtm:RFENDTC .

study:ReferenceExpsosureInterval rdf:type time:Interval;
     time:hasBeginning sdtm:RFXSTDTC ;
     time:hasEnd sdtm:RFXENDTC .

and another important one:

study:Lifespan rdf:type time:Interval;
     time:hasBeginning sdtm:BRTHDTC ;
     time:hasEnd sdtm:DTHDTC .

Now here is where it gets fun. Let's say you want to derive RFXSTDTC and RFXENDTC (first and last day of exposure). Imagine your database has various time:Interval triples for each subject, each describing a fixed dose interval. Imagine in this example, Person1 participates in 3 fixed dosing intervals, as shown in the RDF as follows:

study:Person1 study:participatesIn study:DrugAdministration1, study:DrugAdministration2,

Each administration is associated with an interval: Interval1, Interval2, Interval3, each of which has a time:Beginning and time:End date. One can write a SPARQL query to pull out the minimum (earliest) time:hasBeginning date and the maximum (latest) time:hasEnd date for all the drug administration intervals and thereby derive automatically the two SDTM dates of interest.  The same can be done for RFPENDTC (reference participation end date). I can't tell you how often this date is wrong in actual study data submissions. A SPARQL query can identify all dates for all study activities associated with a Subject and pick out the maximum date, which happens to be the RFPENDTC. Best of all, these standard queries can exist as a resource on the web using SPIN (SPARQL Inference Notation) for anyone to use.

But first, you need study data in the RDF and a Study Ontology.


What's in a Name?

Standardizing clinical trial data is all about automation. Standard data enable automated processes that bring efficiency and less human error. But automating a process, for example, an analysis of a lab test across multiple subjects in a trial, requires computers and information systems to be able to unambiguously identify that lab test. This is called computable semantic interoperability (CSI). The key is "computable." It's not enough that a human can identify the lab test of interest, but computers need to do the same.  I previously wrote about the interoperability problem and I revisit it here today, focusing on test names.

There are two situations that impede CSI: [1] when the same Thing goes by two different names, or even more troublesome [2] when two different Things go by the same name.  When I say Mustang do I mean the car, or the horse? Some describe the term Mustang is "overloaded" because it can represent more than one Thing. Issue #1 is addressed by controlled terminology. Synonyms can then be mapped to a controlled term that all agree to use. Issue #2 is more challenging, but it is avoidable by assigning different names to different things. I consider this a best practice to promote CSI.

As an example, let's look at the CDISC controlled term "glucose" (code C105585). The definition is "a measurement of the glucose in a biological specimen." The reality is that a serum glucose and a urine glucose are two completely different tests, having different clinical meaning and interpretation. I have been advocating for more granular lab test names for a long time so that computers can easily distinguish different tests. The counter-argument is that serum glucose is really two concepts: the specimen and the "thing" being measured (known as the component, or analyte in LOINC), and therefore should be represented as two different variables. In fact, the SDTM does have a separate field for specimen information (LBSPEC), and don't get me wrong, there is value is separate specimen information, but that doesn't diminish the need for different test names. The problem is, one has to tell or program a computer "if test=glucose, look at specimen information to pick out the correct glucose test." But what about another observation, say "Occurrence Indicator" (an FATEST as described in the Malaria Therapeutic Area User's Guide). One must know to look at another field (FAOBJ) to understand that the occurrence is a fever, or a chill. Where to look for that additional data is not always obvious and varies by test. In the Malaria example, we have two different occurrences and they should each have their own name: Fever Indicator, Chills Indicator.

There are two problems with relying on other data fields to disambiguate an overloaded concept: [1] keeping track of which field to disambiguate which test is onerous, and [2] new lab tests are being added all the time. (By the way, LOINC avoids this problem by assigning different codes to different tests and providing separate data fields for analyte, source, method, etc.)

This problem became clear to me when a colleague at FDA, who was using an automated analysis tool and was analyzing serum glucose levels among thousands of patients and was getting funny results. After quite some digging, she realized the tool was pooling serum and urine glucoses. She and I knew to look at LBSPEC. The tool, however, wasn't smart enough to do so. I wonder how many other analyses of other tests have this problem and go unrecognized.

So, in the interest of promoting true computable semantic interoperability without burdening data recipients with unnecessary algorithms to disambiguate overloaded terms, please remember to name different things differently. It can be that simple.


SDTM Data in RDF: Activities in Clinical Trials

PhUSE has approved a new project to evaluate and demonstrate the potential value of using RDF for SDTM data. It's called SDTM Data as RDF.  The project is headed by Tim Williams and myself and it will kick off at the upcoming PhUSE Computational Science Symposium just outside Washington DC.  As many of you know, I'm an advocate for using the RDF for study data. One of the goals of this new project is to develop a simple Study Ontology that, when combined with study data in RDF, can be used to generate high quality, highly standardized and valid SDTM datasets. If successful, it will address a major ongoing problem: high variability in SDTM implementation across studies and applications. 

To achieve that goal, we will develop a simple study ontology using OWL that will support SDTM dataset creation using standard SPARQL queries. It will leverage existing BRIDG classes as needed. We are starting with two domains: DM and VS and, if successful, the ontology will be extended to support other SDTM domains as well as non-standard data that currently wind up in SUPPQUAL or custom SDTM domains. If successful, the project outcome can provide a compelling reason to use RDF for study data today to solve a major SDTM implementation challenge that sponsors currently face. As the project progresses, I plan to discuss modeling challenges and how RDF/OWL can address them. Today I discuss Activities in clinical trials.   

A clinical trial is at its most fundamental construct a collection of activities and the rules that describe when those activities are performed. There are also rules that describe how those activities are grouped (e.g. into arms, visits, epochs, etc.) to facilitate study conduct and analysis.

Our mini Study Ontology divides study Activities into these subClasses:
  1. Observations -- symptoms, signs, tests, etc. that measure the physical, mental, or physiological state of a subject
  2. Analyses -- activities that take as input one or more Observations and generates analysis results
  3. Interventions -- activities that are performed on a subject with the usual intent of modifying or identifying a medical condition (e.g. drug administration, device implantation, surgery)
  4. Administrative Activities: e.g. informed consent, randomization, etc.
The Analysis class has a couple of interesting subclasses: 
  1. Assessment - this activity analyzes Observations and their results to identify (e.g. diagnose) and/or characterize (e.g. severity assessment) a Medical Condition.
  2. Rule - this activity analyzes Observations and their results to determine the start of another Activity. This includes eligibility criteria, which takes screening data and determines whether the subject advances to the next Activity, usually randomization. It also includes more generic start rules such as "take study medication within two hours of headache onset." 
All Activities have outcomes, so there is a class ActivityOutcome. For Observations, the outcome is the result. For Assessments, the outcome is the identification and characterization of a Medical Condition. Many tests in medicine combine Observation outcomes with Assessment outcomes. For example, consider a CT scan of the head. The Observation might be a 2 cm lesion in the right frontal lobe that enhances with contrast, and has mass effect (obliteration of the sulci and a shift of midline structures). The Assessment, performed by a trained Neuroradiologist, establishes the presence of a cerebral tumor. Additional observations and assessments are then needed to confirm the diagnosis and further characterize the tumor (e.g. Grade 3 Astrocytoma). 

The MedicalCondition class contains all the medical conditions that afflict the Subject (including past conditions). By medical condition I mean a disease (e.g. Epilepsy) or a disorder (e.g. Seizure) or a transient physiologic state that benefits from medical Interventions (e.g. Pregnancy). There are two main subclasses: Indication (the reason an Intervention is performed, such as a Drug Administration), and AdverseEvent (a medical condition that begins or worsens after an Intervention is performed). 

So the "core" study ontology looks like this (class view only):

     -- Person

You can imagine a host of properties/predicates that link these together, e.g. HumanStudySubject participatesIn Activity. Activity hasOutcome ActivityOutcome are just two.  

As the PhUSE project advances, we will be testing to see if all study activities will fit in this model. In the meantime, I welcome your comments here but please consider getting involved in the project. We can guarantee all of us will learn a lot and maybe find a better way of implementing the SDTM. And finally, come to the PhUSE CSS if you can! I hope to see you there.