Improving the Study Data Tabulation Model

This post looks at some ways of improving the Study Data Tabulation Model (SDTM).  Why? The Therapeutic Area (TA) standardization initiative has shown that new domains and variables are needed to standardize TA-specific clinical data, yet new domains and variables pose significant implementation challenges.  An almost endless cycle of new domains and variables seem inevitable as more and more TAs are tackled. We need to step back and critically look at changes to make implementation easier and more consistent across studies and sponsors. Others are expressing the need to slow down and make some corrections (see this post by W. Kubick).

SDTM in its current state is too brittle. It is not flexible enough to accommodate new clinical data requirements efficiently. I have suggested in a previous post that we need a new exchange standard, one that is based on a highly relational information model for study data. A more robust exchange standard can handle new clinical data requirements more easily, often with a simple additions of new terms to a standard terminology. However, getting there won't be easy. The suggestions here describe a possible interim solution. Some of the changes are minor. Others are quite major and likely to be controversial. I don't know if this is the right approach, but it makes sense to me as a consumer of clinical data for over 25 years. I believe the status quo is not sustainable. We need to do something.

In another post, I described the "clinical data lifecycle" that reflects how clinical data are generated and used in clinical practice, and how they are documented in a patient's medical record. I described the 4 parts of the traditional SOAP note of a clinical encounter, analogous to a subject visit in a clinical trial:
  1. Subjective Observations
  2. Objective Observations
  3. Assessment
  4. Plan 
These describe the major categories of data for a new and improved SDTM. Let's examine each in some detail. 

Clinical observations provide a snapshot in time of the physical, physiological, or psychological state of an individual. Subjective observations are those that only the individual can make (e.g. pain). Objective observations can be made by others. By themselves, observations are not attributed to any one specific cause or underlying medical condition. Many, in fact, simply measure the normal physiological state. Observations must undergo an assessment to identify and characterize one or more medical conditions that best explain the observations. The medical condition(s) form the basis for the care plan and its execution. The care plan is designed to address (e.g. treat, cure) the medical condition(s). When the care plan involves an experimental intervention as part of clinical research, then this forms the basis of a study protocol. This simple model for clinical data forms a continuous feedback loop with various stopping rules (not shown here). 

But exactly how do we leverage this model to improve the SDTM? The goal is to make SDTM more stable and flexible, i.e. enable it to incorporate new clinical data content requirements more easily and (hopefully) greatly lower the need for new variables and new domains.  This will make implementation easier and SDTM datasets more useful. By its very nature, this is a high level discussion that only scratches the surface of the changes that could be made. I admit that the devil is in the details.  

We start with a proposal to improve cross-references. 

1. SDTM should incorporate unique identifiers for each record in each domain. 

Clinical data are richly inter-related and unique record identifiers simplify the ability to reference other relevant records. The --SEQ variable is currently used for this purpose but by itself is not unique. (Ideally, the identifier is globally unique, so that a record in one study can reference a record in another study. This brings us closer to creating a web of clinical data, which is an important component of the semantic web. But I digress.)

We need to refine the definition of observations. SDTM describes observations as findings, interventions, and events, and each corresponds to an individual row in a tabulation. If we agree that a clinical observation is a measure of the physical, physiological, or psychological state of an individual, interventions and events are not observations. Clinical events are instead medical conditions that explain the clinical observations, not the observations themselves. For example, a low serum sodium measured via a lab test (observation) may indicate, after a proper assessment, the presence of hyponatremia due to a Syndrome of Inappropriate Anti-Diuretic Hormone (SIADH) secretion (medical condition). Medical conditions are identified and characterized through assessments of clinical observations. I discuss the importance of distinguishing clinical observations and medical conditions in another post. So what about interventions? In the purest sense, all clinical observations are "interventions," since the act of observing is always associated with a process, protocol, or procedure. This process of observing "intervenes" in the subject's normal set of activities and can therefore be considered an "intervention" in the strictest sense. However, we generally limit the use of the term medical intervention to an activity whose intent is to affect in some way (e.g. treat) a medical condition. Therefore, the important information classes to support the clinical data lifecycle are  similar to what we have now: observations (i.e. findings in SDTM), interventions, and medical conditions (types of events), but they are defined somewhat differently and than the current SDTM. The next needed change is:

2. SDTM should add an assessment domain to capture important assessment information. 

We've discussed that clinical observations serve as input to an assessment. Each assessment record should link to the clinical observations that serve as input, and the medical condition(s) that serve as output. Practically speaking, this is important because sometimes one needs information about that assessment to ensure the assessment is reliable, i.e. performed correctly and without bias. Who made the assessment? What are their qualifications? What observations were used in the assessment? Were the observations the appropriate ones? Are there any important observations that are missing and should have been done? When was the assessment performed? What process was followed (e.g. was there a formal adjudication process)? What criteria (e.g. diagnostic criteria, rating scale) were used? Currently there is no single, systematic way in the SDTM to capture the information associated with assessments. The end result is a patchwork of variables (many of them custom variables that wind up in SUPPQUAL's) that try to fill this gap. There is a recognized need to document adjudication data and this approach I think does that. As an aside, adverse events are adverse medical conditions that emerge or worsen following a medical intervention. These are identified by an assessment (often not documented) of relevant observations. Currently we routinely confuse an adverse event and the observations that support the presence of an adverse event. This can lead to erroneous interpretations of clinical observations. 

The next suggested change is ....

3. SDTM needs a single approach to describe clinical observations. 

This is a big one, because newly-identified clinical observations for therapeutic areas are triggering an ever increasing number of new variables and domains and this is not sustainable. Each observation is recorded ideally without bias (and standard processes or protocols may exist to minimize bias) and without interpretation by the observer regarding its cause. The standard components of a clinical observation are well known and they define the metadata needed for each observation. These can help guide changes to the generic findings domain in the SDTM (the model, not the I.G.).  For example, is the observation planned or protocol-specified? Is there a documented process or method for making the observation (and a link to that process). Was a device used (link to information about the device), was a biospecimen collected and analyzed? (link to biospecimen information). Observations are often grouped or nested, so grouping and nesting information is needed. Regarding terminology for clinical observations, the Logical Observation Identifier Names and Codes (LOINC) should be adopted as the single terminology to describe a clinical observation. The LOINC was developed for this purpose. The lab portion of the LOINC is quite robust, but the clinical portion will likely need to grow over time to accommodate the range of clinical observations needed for clinical research. This means creating a robust process to register new clinical observations with the LOINC as new codes are needed. Using the LOINC will also help harmonize clinical data used for research with clinical data used for other purposes, since LOINC is used widely in health care. Ideally there is a single clinical observation domain containing sufficient metadata to allow subsetting in multiple ways for analysis purposes (e.g. lab, EKG, CT, MRI, etc). Clearly size limitations of an observations domain quickly become an issue, but these should diminish as data are routinely loaded and stored in databases prior to use, which can then deliver observation data to the analyst in manageable chunks. In the meantime, continued splitting into multiple clinical observation domains may be necessary, but only as necessary to meet data exchange and data management requirements. 

4. SDTM should have a single approach to describe all Medical Conditions. 

By medical condition I mean a disease, injury, or disorder that interferes with well-being. A medical condition is associated with an underlying pathophysiological process. It has a beginning and an end. Medical conditions are the target of medical interventions, whereas clinical observations are not. (Pregnancy is a notable exception in that it is not a disorder but is the target of medical intervention, i.e. prenatal care, to minimize complications.) The medical conditions domain contains past and current medical conditions including medical conditions that emerge during the study. Important information about each medical condition include the following (it is worth noting that not all this information will be known about every medical condition, but there should exist a standard way of describing it if it is available). Date of first symptom/sign (onset date), date of diagnosis (date the medical condition was first identified via an assessment), a link to the assessment record if one exists, a link to the clinical observations that were used for the assessment, date of resolution (end date, which for ongoing medical conditions will be null). Date of reporting (cutoff date). Severity at the time of the last assessment, change in severity since the previous assessment. Is the condition ongoing at the time of reporting? Is the condition the reason for the study (the indication). Is the medical condition considered an adverse event, i.e. any adverse medical condition that emerges or worsens following a medical intervention? Is there a plan to address/treat the medical condition (link to the plan). In the case of the indication, the plan is documented in the protocol. The plan links to the clinical observations and medical interventions performed following the execution of plan. From the medical conditions domain, medical history, adverse events, and clinical events domains can be derived using standard algorithms.

5. SDTM needs a single approach to describe planned medical interventions. 

This includes a link to the medical condition that triggers the intervention, the reason for the intervention (treat, cure, mitigate, diagnose, prevent), the type of intervention (drug administration, device, surgery, etc.) and a link to the planned clinical observations to assess the effect of the intervention on the medical condition. 

Following its execution, the outcome of the care plan is documented as additional clinical observations that lead to additional assessments in a continuous loop. The cycle ends according to stopping rules. Reasons for stopping include  [1] death, [2] resolution of the medical condition, [3] arbitrarily (protocol specified). 

Some of these suggestions constitute major changes. But by organizing the data more closely with the way they are generated and used in clinical practice, the data become more useful and the ability to incorporate new clinical data content is improved without needing frequent upgrades to the model. Adding new clinical content should be as simple as adding new controlled terms to a dictionary. This is critically needed as clinical data content requirements continue to expand with no end in sight. 

I appreciate your comments.  


  1. Armando.
    Interesting post. What I see in your logic is reinforcement to the argument of splitting the data from the presentation and trying to get there sooner rather than persisting with SDTM for exchange and presentation (analysis).

    1. Yes, Dave. I agree with you. It's ironic that your latest post occurred on the same day as mine as there is considerable overlap in thinking. I think my suggestions reinforce your view of splitting the data from the presentation of the data.

  2. Thank you for a great post Armando. Lots of things to digest. Just wanted to way that your no 1 proposal I do think is maybe the most important one: "SDTM should incorporate unique identifiers for each record in each domain".

  3. Unique identifiers in SDTM are there: the define.xml 2.0 mandates sponsors to provide the primary key (as a combination of variables) for each dataset (KeySequence attributes). However, it looks as these keys are not used, neither by the FDA nor by validation tools (the validation tool the FDA is using even ignores the define.xml). Technically, we can easily provide a better mechanism for referencing using these keys, but there seems no will to do so at this moment nor from the FDA nor from the SDTM team.

    1. Question: Does SDTM support unique identifiers for each record or only at the dataset level? This is not clear in the implementation guide.

    2. Armando - your great blog post inspirered me to write one as well builfing on the good duscussion here and on LinkedIn

    3. Link to my blog post http://kerfors.blogspot.se/2016/05/global-persistent-and-resolvable.html

  4. Dear Armando,
    You are right: that is not clear from the IG. If the keys are chosen well, the uniqueness should be at the record level. A better source may be the define.xml 2.0 specification. Good keys are e.g. STUDYID + DOMAIN + USUBJID + CAT + TESTCD + DTC + VISITNUM (but that might be overkill...). However, they seem not to be really used by anyone. The validation tool the FDA is using does not seem to read them from the define.xml at all for validating record uniqueness.