2016-06-01

Five Reasons to use the RDF for Study Data

The currently supported file format to exchange clinical trial data with the FDA is the SAS Transport version 5. It is generally well recognized that this format is obsolete and should be replaced with a more modern format. There is one effort underway to identify a new format. This post makes a case for considering the Resource Description Framework (RDF) for representing and exchanging study data. I have been learning quite a bit about the RDF for the past several years. Although I am not an expert, I've been able to appreciate the benefits RDF offers over other potential solutions.

The RDF provides a general approach for describing data of any kind. It is the foundation of the "semantic web." Multiple serialization formats exist for the RDF, including RDF/XML, turtle, and JSON-LD. Numerous freely available converters exist on the Web. Just do a Google search for "RDF format converter."

When I say RDF, I actually mean a suite of related standards that "play nicely together" and exist to ensure the meaning or semantics exist and travel with the data. These include the RDF, OWL (Web Ontology Language), SPARQL (the query language of the semantic web), and SPIN (SPARQL inference notation).

So why use the RDF? There are several considerations to keep in mind. First, I'd like to point out the Yosemite Project. It is an effort to improve semantic interoperability of all structured healthcare information by using the RDF. It provides numerous compelling arguments in favor of the RDF. Then there is a lot of activity exploring the use of the RDF in the biomedical information space. For example, there is existing work in expressing CDISC standards in the RDF. There is an ongoing effort to develop an RDF representation of HL7 FHIR. There are numerous other efforts to express other standards, such as MedDRA, SNOMED CT and other terminologies using the RDF.

It is worth mentioning that it is not absolutely necessary that data be exchanged using the RDF. Other standards may be acceptable as long as they have an unambiguous representation in the RDF and can be easily converted to/from RDF. This helps ensure a level of semantic clarity that is often lacking in existing standards due to unrecognized ambiguous definitions or naming of concepts. In a previous post, I describe best practices for defining concepts in a way that facilitate an unambiguous RDF/OWL representation. Converting to an RDF representation (particularly OWL), in my experience, tends to surface semantic ambiguities that previously may have gone unrecognized and impede semantically interoperable data exchange.

Here are the five reasons.

Reason #1. As the industry considers retiring the current exchange format, any new format should be capable of accommodating new content requirements easily while maintaining backwards compatibility as much as possible. Switching to a new format is going to be costly, so it's important to pick one that is useful over the long haul. Tools must be modified to be able to read and write data using the new format. The standard must be robust to accommodate current requirements and flexible enough to accommodate new requirements easily as they emerge, as change is inevitable. The RDF is capable of supporting this. Ontologies and vocabularies may need updating, but the underlying tools that interact with the updated ontologies don't change, because the underlying standard, the RDF, is the same. This is extremely powerful. Incorporating new content requirements has turned out to be a major roadblock in implementing existing standards. I would argue that this reason alone is the major reason for testing and transitioning to the RDF for study data exchange.

Reason #2. Standardizing study data often requires mapping data that exist in legacy formats. The RDF provides powerful mapping capabilities, using standard predicates (i.e. relationships between variables and data points) such as owl:equivalentClass and owl:sameAs. One can also leverage other existing ontologies such as SKOS (Simple Knowledge Organization System). By entering a single RDF triple in a database, one can assert an existing variable as equivalent to standard variable and the mapping is essentially complete from a computational standpoint.

Reason #3. Standardizing study data requires implementing not just a single standard, but a suite of exchange, terminology, and analysis standards. One needs to link all these standards together in a web of data and metadata. For example, a variable from Exchange Standard A has permissible terms from Terminology B. All of these are currently documented using various formats across various sources. Integrating all the necessary standards for implementation is very labor intensive, particularly as each standard evolves and different versions emerge. All the necessary standards can be expressed in the RDF, which is uniquely designed to link data across multiple sources.  One can link multiple standards in one single ontology, which is published and maintained on the web as a resource. This would greatly simplify implementation and maintenance.

Reason #4. The RDF (and related standards) can be used to represent not only the study data, but the metadata, including definitions for derived data, the validation rules, and even the analysis plan. Any kind of information can be represented using RDF and this enables computers to link everything together for easy access and analysis. FDA guidances, regulations, and statutes can be expressed in the RDF. Why is this useful? If study data are expressed in the RDF, one can link to the relevant guidance recommendations expressed in the RDF and computers can help determine if the data are guidance-compliant. For example, an FDA guidance could have a series of RDF triples describing the recommended properties of a nonclinical skin toxicity study. A computer can then compare those recommendations with the actual properties of a study and identify any discrepancies. To my knowledge, no other existing standard is capable of supporting this important use case, which is currently a very time consuming, manual process. It is often very costly when discrepancies are identified too late, after studies are well underway or complete.

Reason #5. Another advantage is that RDF data can reside on the web and be easily accessible, easily queried, and easily combined with other RDF data. Assuming security concerns are appropriately addressed (i.e. proprietary data must reside behind firewalls and accessible only to authorized entities), this supports a future state where data are no longer exchanged, but users are simply provided access to the data on the web. RDF and related standards already support this distributed nature of managing and querying data from multiple sources.

Many of these benefits will take time to realize, but I think it's important to take a long-term perspective. I admit my analysis comes from a novice who has only begun to scratch the surface of the RDF and what it offers. I welcome thoughts from true RDF experts. I strongly encourage those not that familiar with the RDF to learn more about it. I have personally found this resource to be quite useful. There are clearly major technical challenges to overcome, but in my humble opinion, RDF represents the most promising alternative for the future of study data exchange. It would enhance computable semantic interoperability and open a new era in managing and integrating data from multiple sources, which in my view is the key for the next major phase of medical discoveries.