The ability of different information technology systems and software applications to communicate, exchange data, and use the information that has been exchanged.
In an excellent article by CN Mead, he succinctly defines CSI as unambiguous data exchange.
From my perspective, there seem to be two problems leading to an insufficient degree of CSI.
The first problem is the use of the same name for different concepts. An example would be the name "ventricle," which can mean a heart ventricle or a brain ventricle. Another is the word "drug," which in one context means a good thing (e.g. a medication) or a bad thing (illicit drug). In a previous post, I discussed how the CDISC SDTM Reference Start Date (RFSTDTC) can mean different things in different studies because the definition is not sufficiently precise.
The second problem is the use of multiple names for the same concept. An example would be stroke, cerebrovascular accident, and cerebral infarction. We all know how controlled terminologies exist to address these problems.
The reality is we are never going to get away from these two problems. The world is much too large and too diverse to impose the same data standards across a domain as complex as health care. Yet, the "disambiguation" of concepts and their names (or "labels") must occur before meaningful pooling and analysis of data across multiple sources can happen (i.e. before the data are truly computable). One always wants to compare "apples to apples" in any analysis, especially those with public health implications. The current state is that this disambiguation occurs manually, resulting in (no surprise) slowness, inefficiencies, and inconsistencies or errors.
How can computers help? If data were expressed using the Resource Description Framework (RDF), some of the disambiguation can be automated. Here's how. All resources (e.g. concepts) have an Unique Resource Identifier (URI), similar to a URL on the web. In the same way that the Google home page cannot be confused with the Yahoo! home page (because they have different URLs), systems can distinguish between a ventricle in the heart and a ventricle in the brain because they each have a different URI. Two resources with the same name/label are treated as different if their URIs are different.
A little bit harder is how to address problem #2. Here we get into a little bit of philosophy on what it means for two things to be the same. This discussion gets very complicated very quickly. Most will agree that Mary Alice Smith and M.A. Smith are the same individual if they have the same parents and were born on the exact same date and time in the same hospital (assuming here that twins are typically born a few minutes apart). But is Mary Alice Smith, the 10 year old, the same as the Mary Alice Smith the 30 year old? People often agree "I was not the same person when I was 21 as who I am now."
From a practical, I would say computational, perspective, two things might be considered the same if they share the same properties. In RDF, properties are described using predicates and objects. So M.A. Smith may have a predicate called ":birthdate" with object (value) of 1980-01-26T12:15:33. In the semantic web, one can search and identify all predicates and objects for a resource and compare them to the predicates and objects of another resource and see if they match.
When you think about it, this is what we do manually. We inspect the two "things" and compare their properties. If enough of the important properties match, then for all practical purposes they are the same. If properties don't match, it raises the question "are they different?" With time, additional more important properties might be identified that force us to change our conclusion and consider two similar things as different. Determining "sameness" is not an exact science as it turns out. Consider breast cancer. There was a time when certain types of breast cancers were considered the same disease and treated the same way. Along came more sophisticated testing techniques, starting with estrogen receptor testing and lo and behold, those tumors that were estrogen receptor positive behaved differently when exposed to the estrogen blocker tamoxifen. What was previously considered the same is now different. This happens all the time in medicine as new medical knowledge emerges. We used to think all cholesterol was the same, until LDL-cholesterol and HDL-cholesterol were discovered. It now appears that Parkinson's Disease is likely a collection of different conditions, each behaving differently and having different treatment responses.
The bottom line is expressing the properties of resources (e.g. things) in RDF allows information systems to identify and compare the properties of two resources and make a reasonable (but not perfect) determination of sameness. This assumes that we never have complete information about anything. But, what was previously a manual process can now be automated. As we learn more about a resource, more properties emerge and can be documented in RDF, and determinations of sameness become more accurate.
This is huge from an interoperability standpoint.
In an excellent article by CN Mead, he succinctly defines CSI as unambiguous data exchange.
From my perspective, there seem to be two problems leading to an insufficient degree of CSI.
The first problem is the use of the same name for different concepts. An example would be the name "ventricle," which can mean a heart ventricle or a brain ventricle. Another is the word "drug," which in one context means a good thing (e.g. a medication) or a bad thing (illicit drug). In a previous post, I discussed how the CDISC SDTM Reference Start Date (RFSTDTC) can mean different things in different studies because the definition is not sufficiently precise.
The second problem is the use of multiple names for the same concept. An example would be stroke, cerebrovascular accident, and cerebral infarction. We all know how controlled terminologies exist to address these problems.
The reality is we are never going to get away from these two problems. The world is much too large and too diverse to impose the same data standards across a domain as complex as health care. Yet, the "disambiguation" of concepts and their names (or "labels") must occur before meaningful pooling and analysis of data across multiple sources can happen (i.e. before the data are truly computable). One always wants to compare "apples to apples" in any analysis, especially those with public health implications. The current state is that this disambiguation occurs manually, resulting in (no surprise) slowness, inefficiencies, and inconsistencies or errors.
How can computers help? If data were expressed using the Resource Description Framework (RDF), some of the disambiguation can be automated. Here's how. All resources (e.g. concepts) have an Unique Resource Identifier (URI), similar to a URL on the web. In the same way that the Google home page cannot be confused with the Yahoo! home page (because they have different URLs), systems can distinguish between a ventricle in the heart and a ventricle in the brain because they each have a different URI. Two resources with the same name/label are treated as different if their URIs are different.
A little bit harder is how to address problem #2. Here we get into a little bit of philosophy on what it means for two things to be the same. This discussion gets very complicated very quickly. Most will agree that Mary Alice Smith and M.A. Smith are the same individual if they have the same parents and were born on the exact same date and time in the same hospital (assuming here that twins are typically born a few minutes apart). But is Mary Alice Smith, the 10 year old, the same as the Mary Alice Smith the 30 year old? People often agree "I was not the same person when I was 21 as who I am now."
From a practical, I would say computational, perspective, two things might be considered the same if they share the same properties. In RDF, properties are described using predicates and objects. So M.A. Smith may have a predicate called ":birthdate" with object (value) of 1980-01-26T12:15:33. In the semantic web, one can search and identify all predicates and objects for a resource and compare them to the predicates and objects of another resource and see if they match.
When you think about it, this is what we do manually. We inspect the two "things" and compare their properties. If enough of the important properties match, then for all practical purposes they are the same. If properties don't match, it raises the question "are they different?" With time, additional more important properties might be identified that force us to change our conclusion and consider two similar things as different. Determining "sameness" is not an exact science as it turns out. Consider breast cancer. There was a time when certain types of breast cancers were considered the same disease and treated the same way. Along came more sophisticated testing techniques, starting with estrogen receptor testing and lo and behold, those tumors that were estrogen receptor positive behaved differently when exposed to the estrogen blocker tamoxifen. What was previously considered the same is now different. This happens all the time in medicine as new medical knowledge emerges. We used to think all cholesterol was the same, until LDL-cholesterol and HDL-cholesterol were discovered. It now appears that Parkinson's Disease is likely a collection of different conditions, each behaving differently and having different treatment responses.
The bottom line is expressing the properties of resources (e.g. things) in RDF allows information systems to identify and compare the properties of two resources and make a reasonable (but not perfect) determination of sameness. This assumes that we never have complete information about anything. But, what was previously a manual process can now be automated. As we learn more about a resource, more properties emerge and can be documented in RDF, and determinations of sameness become more accurate.
This is huge from an interoperability standpoint.
No comments:
Post a Comment