Using the RDF to Generate (and Validate) an SDTM Demographics Domain

In a previous post, I discussed how the Resource Description Framework could be used to improve the management of biomedical knowledge. The discussion was theoretical. It was not possible to appreciate how RDF could be used in the short term to provide immediate value. RDF, or any other technology for that matter, will not be adopted or implemented unless it can solve real problems especially in the short term.

Here I discuss a simple, practical  application of the RDF to demonstrate how it can solve a real problem now. In another prior post, I described that the ideal role of the SDTM is a standard report from a database that is used for analysis. This example automates the creation of an SDTM demographics domain from an RDF database (called a triple store). First, I create a simple ontology of a study. Then, I use it to generate sample study data in RDF. Then I then store the ontology and the data in a simple RDF triple store database (a "knowledgebase"), and then I use SPARQL (the RDF query language) to query the database and generate an SDTM demographics domain. I discuss how this strategy can also be used to validate the data. I used a commercial, off-the-shelf (COTS) product: TopBraid Composer. The RDF file used for this exercise is available for download in Turtle format.

First I created a mini study ontology, containing only the classes and properties needed for this small exercise. You'll recognize many of the classes from BRIDG and the SDTM. I added a new class called SDTM_Domain which will contain a resource for each instance of an SDTM domain/dataset.

Mini Study Ontology - Classes

I then created the properties. First are the object properties (in blue)...those that relate two classes with each other, then the datatype properties (in green)...those that describe the data:

Mini Study Ontology - Properties

For example, the :conductedAt property relates the Study_Site class with the Country class. It enables asserting that Site 0001 was conducted in Germany, for example. These relationships are captured as Domain and Range information for each property using the standard rdfs:domain and rdfs:range properties. Another example is that the :age property has the :Subject class as its domain and the datatype xsd:integer as its value (i.e. range).

Using this ontology, I populated the triple store using dummy instance data for 10 subjects (you'll see the number 10 next to :Study_Subject in the diagram above to indicate the database has 10 study subject instances). Similarly I entered 4 instances of study sites, 2 instances of Sex (male, female), etc.

Finally, I created a single instance of an SDTM demographics domain, calling it :demographics:
Demographics domain Resource
As a property of this resource, I selected a standard property "spin:query" using the SPIN ontology (SPARQL Inference Notation) to embed the SPARQL query in a triple in the database. My understanding is that the instructions on how to generate the DM dataset (written in SPARQL) now becomes part of the knowledgebase. I need experts to confirm this is correct. Here is what the SPIN query looks like.

SPARQL Query to generate DM Domain
So when I run the query, I get the DM domain of the study as a report out of the knowledgebase.

DM Domain generated from RDF data
The fact that I was able to do this having very limited technical experience speaks, I think, to the power and simplicity of this technology. Another benefit is that the knowledgebase is perfectly functional to generate one domain now, but can easily be expanded iteratively over time to generate all domains from RDF study data. Another big advantage is that validation rules in RDF can be added to the knowledgebase to enable reasoning to identify validation errors. In fact, the existing triple in this knowledgebase,

             :age rdfs:range xsd:integer.

which basically says permissible values for ages are integers, represents an executable validation rule in the knowledgebase. If one enters a non-integer value for age, a reasoner can identify and surface the contradiction....which is basically a validation report. More complex validation rules can be constructed using the RDF, for example, for positive integers, cardinality constraints, value set constraints, etc.

As a next step, it would be useful to redo this example using the publicly available CDISC Foundational Standards in RDF specification. I was going to do this but haven't gotten around to it.

Longer term, these datasets should be "generate-able" from a BRIDG ontology. I believe an OWL representation of BRIDG will pave the way to generate even more useful reports for analysis (see another post: BRIDG as a Computable Ontology). Then one could populate the knowledgebase with, or incorporate by reference, more and more validation rules and even FDA policy statements expressed in the RDF and automate the ability not only to detect invalid data, but also data that doesn't conform to FDA study data submission policies as described in guidances and regulations.

In summary, the RDF provides the capability to implement practical solutions now that provide an alternate mechanism to automatically generate SDTM datasets using simple COTS tools, while at the same time provides the flexibility to increase the capabilities of the knowledgebase to support more and more solutions, such as:
  1. Generate all SDTM datasets 
  2. Validate the data 
  3. Determine conformance with FDA submission policies
  4. Generate more useful views/reports for analysis

Once this capability in the knowledgebase is fully developed, then one would not need to exchange the tabular reports, but can exchange the RDF data themselves. Or better yet, given the distributed or "linked data" capabilities of the semantic web, the recipient can simply be granted access to the RDF data on the web.


  1. Armando, this is excellent! Taken in conjunction with your post on terminology mapping, one can truly begin to see the power of RDF for data integration. Consider the following simple example, I also have some old, legacy data in RDF. I determine that TreatmentAssignment class is the “sameAs” your Arm class and some values of my TreatmentAssignment class are the “sameAs” your Arm class. I can use the owl:sameAs property to link our ontologies and start integrating our disparate data sources. Again, a VERY simple example, but consider the amount of time and resources those working in the clinical trial space spend on data integration. We assesses our disparate data sources - legacy data, data in different versions of industry standards, etc. We document what we have. We document (create specifications) how to harmonize these different sources into one representation. We write code to harmonize the data. We validate code. It’s not a trivial task. RDF is not a push-button solution, but it can definitely make this arduous task easier.

  2. I agree with Scott, a brilliant analysis and especially the conclusion in the end is great! I have had similar ideas myself. When reading the "CDISC Standards RDF Reference Guide" I started to think that this Guide could be developed further into an OWL schema for the actual data, linking all the CDISC standards together into one.

    Providing the entire RDF dataset to the authorities instead of different (similar) tabular formats for different purposes would simplify the life for everyone leaving it up to the receiver to extract the data using Stylesheets or SPARQL queries.

    If we then also could get MedDRA and WHO Drug (officially) as linked data sources life would be made even more simple!

  3. Scott, Magnus, thank you for your comments. I think it would be useful to do a small demonstration project to show how these linked ontologies (CDISC, WHO-DDE, MedDRA) can be used to create a knowledge base from which tabular data for submission and analysis can be derived. My thoughts is that the ontologies need not be complete at first, but complete enough to support a specific, narrow use case, and then can be expanded gradually over time