Do We Need a Study Data Reviewer's Guide?

As part of a robust study data standardization program, the U.S. FDA publishes the Study Data Technical Conformance Guide. The purpose of this document is to provide "technical specifications for sponsors for the submission of animal and human study data and related information in a standardized electronic format" for investigational and marketing applications. Section 2.2 of the guide recommends the submission of a Study Data Reviewer's Guide to "describe any special considerations or directions or conformance issues that may facilitate and FDA reviewer's use of the submitted data and may help the reviewer understand the relationships between the study report and the data." Although FDA doesn't recommend the use of any specific SDRG template, it references a standard template developed by the Pharmaceutical Users Software exchange (PhUSE). 

Let's take a close look at this template. The Purpose of the document is to provide "context for tabulation datasets and terminology that benefit from additional explanation beyond the Data Definitions document (define.xml).  In addition, this document provides a summary of SDTM conformance findings." 

Here is some of the information suggested for inclusion in the SDRG

  1. Is the study ongoing? If so, describe the data cut or database status?
  2. Were SDTM used as sources for the analysis datasets?
  3. Do the submission datasets include screen failures?
  4. Were any domains planned but not submitted because no data were collected?
  5. A tabular listing of eligibility criteria that are not included in IE domain
Before we tackle the question posed at the top of this post, let's ponder a broader question: why do we need standardized data? This one is easy. Standardized data enable process efficiencies and automation. In the case of clinical trials data, reviewers are instantly familiar with the structure of the data, because it is the same across all SDTM-based study datasets. This immediate familiarity with the data structure certainly leads to review process efficiencies. But it only starts there. A common structure and common vocabularies lead to the development of standard analyses that can be automated and reused across studies.

If standardized data leads to increased familiarity with data structures then this should lead to a decrease in additional materials needed to explain the data. But we now have yet another document to explain the data that we didn't have before. The fact that a document like the SDRG is needed at all implies that there are additional data, or additional meaning behind the data, that are not captured in the datasets. 

If we had a truly semantically interoperable data exchange, there would be no need for an SDRG. The meaning behind the data would be with the data, not locked up somewhere else in a human-readable text document. In other words, the need for a Study Data Reviewers Guide represents a failure of the data standards and/or the implementation of the data standards in achieving an adequate degree of semantically interoperable data exchange.  

Sounds harsh? I believe this last statement is true. Let's look at some examples. The SDRG should describe if the study is ongoing. The data contain a study start date and study end date. If a study is ongoing, the study end date should be null. Because a null value for this variable could be due to other reasons, a separate variable (similar to the HL7 null flavor) can describe why the end date is null. Controlled vocabulary can describe the various possible reasons. This approach provides both a standard machine readable approach and human interpretable way of knowing if the study is ongoing. One could even add a 'ongoing study' flag in the trial summary (TS) domain if desired.

Here's another one: Were SDTM used as the source for analysis datasets? If one described each data point as a resource, each with a unique resource identifier (URI), then a system can easily determine where that resource came from. One could see that a data point in an analysis dataset is the same data point (i.e. resource) as what is in the SDTM. These URIs make traceability/data provenance analyses so much easier.

How about this one: Do the submission datasets include screen failures? Each subject should be linked to an administrative study activity called 'EligibilityTest" (or something similar) and the possible outcomes of which are TRUE or FALSE. A subject with EligibilityTest=TRUE means they passed screening and are eligible to continue in the study. EligibilityTest=FALSE means they failed screening. A quick scan of the data would determine if there are any subjects with EligibilityTest=FALSE indicates screen failures are present in the datasets. (Note that the rules for determining TRUE or FALSE are the eligibility criteria themselves, which have a bearing on the next example)

Another example is: the SDRG should contain a tabular listing of eligibility criteria not found in IE (inclusion/exclusion criteria dataset). All study activities should have well-described start rules. The study activity start rules for determining whether a study activity = Randomization can begin are themselves the eligibility criteria. A description of the Randomization start rule is incomplete without a listing of these rules. Their presence in the data would make it unnecessary to repeat them in an SDRG.

So what is the answer to our initial question? If data standards and their implementation were adequate, we would not need an SDRG. That fact that we need SDRGs today should be a sign our study datasets still lack important meaning that analysts need to interpret/analyze the data. It implies that more standards development and better implementation of standards are needed to increase semantically interoperable data exchange. The SDRG should eventually not be needed and should disappear. I think I'm not alone in wishing for the SDRG's eventual demise.

Please share your thoughts and ideas.