The Future of Study Data Exchange

The CDISC Study Data Tabulation Model (SDTM) is the FDA-supported exchange standard for subject level clinical trials tabulation data. I have been working with the SDTM since it was first developed by CDISC. I think I know it pretty well. 

It has served the Agency well over the years but its limitations are becoming increasingly problematic. No data standard is perfect but SDTM has some serious limitations. One of the biggest problems with the SDTM is that it is used as both an exchange standard and an analysis standard. As I elaborate below, the requirements for the two use cases are different and often competing. The result is the SDTM is pulled in two opposite directions and cannot do both optimally.  

What do I mean by an exchange standard? A reasonable working definition is a standard way of exchanging data between computer systems. An exchange standard often describes standard data elements, and relationships necessary to achieve the unambiguous exchange of information between different information systems.

Exchange standards exist to support interoperability. The HIMSS (Healthcare Information Management Systems Society) defines interoperability as:

The ability of different information technology systems and software application to communicate, exchange data, and use the information that has been exchanged.

An analysis standard describes a standard presentation or view of the data to support analysis. It includes extraction, transformation, and derivations of the exchanged (i.e. submitted) data.

A file format (e.g. sas transport file, XML, MS Word, pdf) is not an exchange standard. Data based on an exchange standard need to be serialized using a file format, e.g. SDTM+XPT, Structured Product Labeling(SPL)+XML. Consolidated CDA + XML.

A good exchange standard promotes machine readability and process automation at the expense of end user (human) readability. The data is transformed to make it user friendly. A good analysis standard promotes human readability/usability and use of the data using analysis tools.

A useful analogy is to consider data as pieces of furniture. The exchange standard describes the standard container to move the furniture between two places. The size and shape of the container are dictated by what one is moving; the contents. The container is designed for efficiency in moving furniture and to avoid damage/loss. Everything you need is there; but unpacking is necessary for the furniture to be useful. The analysis standard describes the arrangement of the furnishings in the house. One may have to assemble certain pieces of furniture before they can be used. The arrangement maximizes the use of the furnishings (e.g. all cooking appliances go in the kitchen).

Here is a high level comparison of the requirements for each type of standard. One can see the differences. The challenges of meeting both are clear.


Exchange Standard
Analysis Standard
No duplication of data/records (minimize errors, avoid inconsistencies, minimize file size)
Duplicate data (e.g. demographics and treatment assignment in each dataset)
No derived data (minimize errors, avoid data inconsistencies, promote data integrity)
Derived data (facilitate analysis)
Very standard structure (highly normalized, facilitate data validation and storage)
Variable structure to fit the analysis
Numeric codes for coded terms (machine readable, facilitate data validation)
No numeric codes (human understandable)
Multi-dimensional (to facilitate creation of relational databases and reports of interest)
Two-dimensional  (tabular) containing only relationships of interest for specific analysis (easier for analysis tools)
Standard character date fields (ISO 8601: ensures all information systems can understand them: semantic interoperability)
Numeric date fields (facilitates date calculations, intervals, time-to-event analyses)

As an exchange standard, SDTM is used to exchange data between the applicant and FDA. In CDER, it is also used to load study data in the Janus Clinical Trials Repository (CTR). As an analysis standard, SDTM is used to perform standard basic analyses of the data (demographics, adverse events, etc.) and to explore the data. CDER implements certain standard analyses as part of the CTR environment).

Because of the differing and competing requirements of data exchange and analysis, SDTM is being pulled in two directions and cannot perform either role optimally. We must choose which one is more important. (Already the FDA recognizes that standard SDTM datasets do not support standard analyses optimally and has developed modifications or “enhanced” SDTM views to improve support for the analysis use case.)

I believe there is an increasing view that for data exchange we need to move to a more modern, relational exchange standard for clinical trial data; one that is based on a more robust information model, e.g. BRIDG+XML; FHIR; or semantic web standards like RDF/OWL (information modeling is out of scope for today’s post but I plan to discuss this topic in the future). As one small example, it’s unnecessary that STUDYID should be repeated thousands of times in a study data submission, once for each record in a domain. The STUDYID should be provided once and referenced as needed. A more significant example is that SDTM lacks relationships between planned and performed observations, making it challenging to analyze the data for protocol compliance and violations (note: SDTM provides a work-around, the PV domain, but this is not ideal.)

So what do I think is the future of the SDTM?

SDTM as an exchange standard should be retired, and replaced by a “next generation” exchange standard described above. We need a broad conversation with multiple stakeholders to discuss what this should be. FDA can play a leadership role in that conversation, as it started to do by holding a public meeting on this topic in 2012. 

SDTM as an analysis standard should remain and expand to make it more analysis friendly (i.e. follow the direction that the “enhanced SDTM” has already forged) and eventually merge with ADaM (the CDISC AnalysisData Model; there are already interesting discussions on how that might work). In the future, SDTM should be a standard report for analysis that a database (e.g. CTR) produces.

Once liberated from its role as an exchange standard, future SDTM versions can become even more useful: e.g. add core demographic and treatment variables to all domains; provide numeric dates. These are some of the changes that FDA is already implementing in its “enhanced SDTM” specification for internal use.

So for the future of SDTM, I see two options. 

Option A. This is basically what we are doing today. I call this the bicycle approach.

  • Incremental improvements to SDTM
  • Add additional variables
  • Add additional domains
  • Update implementation guide
  • Update validation rules
  • Slow adoption by sponsors & FDA
  • Inconsistent implementations
  • Redesign databases, tools as needed
  • Repeat the above for the next set of requirements

This is slow and inefficient; like riding a bicycle from Washington DC to California. You can do it, but it is slow. There is a better way.

Option B. I call this the race car approach.

  • Adopt the next generation standard for study data exchange
  • Based on a well-structured information model: A better model that more realistically reflects clinical trial data and how they are related to each other and to the protocol; with a single standard representation for clinical observations
  • Incorporating new data and analysis requirements will be quicker, cheaper, more efficient both for sponsors and for FDA
  • Often involves just adding new controlled terms to a terminology
  • underlying data model, implementation guides, database structure, conformance validation rules, tools need not change (when they do, they should be infrequent, minor)
  • Invest in tools to transform the data any which way
  • Generate any data view of interest from the enterprise data warehouse: e.g. enhanced-SDTM, ADaM, future analysis views X,Y,Z

Having said this, we must recognize the tremendous amount of resources being devoted today towards implementing the SDTM as an exchange standard. Any transition strategy must take into account the natural lifecycle of systems and processes being used to support current operations. A new exchange standard needs to be introduced gradually, with a sufficient overlap period to minimize disruption. Any content standards (e.g. standard data elements, terminology) developed and implemented during Option A are readily reusable in Option B. They are not lost.

In conclusion, SDTM cannot fulfill both data exchange and data analysis roles optimally.  SDTM as an exchange standard should be retired. Short term, it is too disruptive to replace SDTM as an exchange standard at this time. The exchange standard use case for SDTM must continue to be supported for now. Long-term, once liberated from its role as an exchange standard, SDTM’s future as an analysis standard is very bright. Long term planning, including a sensible transition strategy, should move towards a next generation exchange standard for study data.


  1. Armando, it's interesting that you refer to the exchange and analysis roles as that is what was important in your role at the FDA. There is a whole other role within the sponsor organizations which revolves around being able to use the model operationally to get the work done on a day by day basis. Unfortunately, I believe SDTM also falls short of fulfilling this role as well. I'll quote something you presented many years ago - the world is not flat and neither is clinical data. One of my colleagues has begun calling the data multi-dimensional and we need a model to support that relationship. I think we can leverage the great work done within SDTM and the other standards to move it to the race car.

  2. Chris, thank you for your perspective. The operational use case is one that I had not considered. Would it be useful to document high level requirements for this use case? Has this already been done? I believe ODM was developed to support this use case and I'm curious to hear from others how well that is working. I have no personal experience in that regard. Others have correctly pointed out that there is no benefit in moving towards a more multi-dimensional, multi-relational data model if the relationships are not being captured at the point of data collection (e.g. CRF). This is very true...but I do think that a multi-relational data model will enable the design of smarter eCRFs that can capture those relationships automatically "behind the scene."

  3. This was recommended reading for ODM v2 discussion. Consider project DataSphere. A sponsor de-identified patient level data and is making them available to other researchers.