Guideline documents are edited using GEM Cutter to form GEM Documents. ( for example.) These are then uploaded to a repository using a program to extract the essential elements. The next step is to run Apache cTAKES, which is an UIMA-based NLP processor for clinical documents that creates annotations for the guideline text. We adopt the YTEX version so that the results are stored in a relational database. The final step is to create SVM classifiers based on training sets created by clinical experts. See Action Types below.
An important initiative that we are pursuing is the automated identification of clinical unique identifier (CUI) and ICD10 codes with recommendation text. The challenge is to optimize the set of codes that are related to the recommendation text. Using a restful interface from the UMLS, a single concept submitted to the search engine can produce a wide range of returned CUI’s (False Positives). If we submit bi-grams and tri-grams, we receive a different but related set of codes. We have utilized similarity measures to help reduce the result set with the goal of reducing a loss of information (False Negatives). Currently, we have employed YTEX to produce a canonical set of concepts and UMLS codes, methods to generate n-grams for search terms, and similarity measures (Cosine and Sorenson-Dice coefficient) to reduce the matches generated from search results.
Another way to approach the problem is to keep the top returned result only and then determine its suitability in matching the code with the text. For example, if we send the common deontic term “should” to the UMLS, the first returned code is Concept: [C0882088] Multisection:Finding:Point in time:Shoulder:Narrative:MRI, which is not a code that we want to associate with the recommendation. So the challenge is to determine if this one result is sufficient to describe the submitted term. Limiting terms that are not significant in capturing the meaning of the content will also improve results.
N-grams are useful in that they allow for the calculation of probabilities, since we can convert frequency counts to probabilities after normalizing. An expression that has a high probability of occurring, which is then found to occur in some unidentified text, will be a clear indicator of its likelihood to be related to the text in question. Combined with the use of other similarity measures, we hope to form an optimized system for identifying recommendations that are related to a clinical task or action.
Action Types have been used to create a classifier. The goal is to identify recommendations using these categories. This will work to improve identifying recommendations related to specific clinical decision support tasks. Action types are based on work done by Essaihi et al. Action types are used explicitly in Bridge-Wiz to aid guideline developers in choosing actions from a controlled vocabulary for the development of guideline recommendations. You can find more detailed information on the implementation of Bridge-Wiz at Glides. Bridge-Wiz has also been adapted as a web application.
The classifier we tried first was a Naive-Bayes Classifier. (See below for a better approach, Fasttext.)
These will enable users to categorize recommendations based on Action Type.
The Action Types are:
- Gather Data
- Draw Conclusion
- Perform Therapeutic Procedure
The benefit of classifying with Action Types is that it gives a clear indication of where in the course of a clinical encounter the recommendation is relevant. From there, we may further classify recommendations based on a more clinically specific vector space model, such as organ or disease type. The approach we take is an iterative one, so that we build a clear picture of a recommendation without the need to rely on manual development.
Our work with Action Types demonstrated some interesting results. Common terms, like receive, are used in different settings and can throw the model off and turn a Procedure into a Prescribe action type. Clearly, Naive Bayes will need more training examples to overcome this hurdle. Also, a phrase such as “A focused exam of the hair” is asserted to be a Monitor rather than Test activity, which is somewhat understandable. Another example of interest is this recommendation, “The presence of maternal thyroid disease is important information for the pediatrician to have at the time of delivery.” The experts said it was a Test, but the model said it was Educate/Counsel at 72% confidence. There is an inherent inference that this model cannot make.
We have furthered this activity by implementing the FastText classifier. This is a C++ program that we compiled for our Linux platform and ran in a Java program utilizing our expert’s Action Types data set. The results are displayed in the AI section for comparison with our Naive Bayes approach. We have not implemented multi-labels, although FastText has this feature. (This was the primary source of error.) We used n-grams of size 2, learning rate of 1, and epoch of 30. The results are very encouraging. If you wish to try it out you can use this data set, where we have added a “__label__” tag to work with Fasttext.
Acronyms are always a stumbling block for NLP and our capturing of these in a reference database table should help in reducing these errors.
Our next activity will be to examine ways to utilize n-grams to identify recommendations that are applicable in a clinical setting. Based on the frequency results, we will convert these to probabilities and conditional probabilities. We can then assign a likelihood that a given recommendation will be relevant to a clinical context source, such as an order set or EHR clinical note. We are also going to see how to build on our Fasttext results.
We have been reviewing the work of Wessam Gad El-Rab, “Clinical Practice Guideline Formalization: Translating Clinical Practice Guidelines to Computer Interpretable Guidelines,” and have considered the usage of the Action Palette in that work. Given our excellent Fasttext results, we wish to pursue the idea of whether these transitive verbs can be used to produce an even stronger mechanism to categorize recommendations. Are the verbs consistent with our Fasttext results? Are they necessary and/or sufficient? We plan to explore this connection and review its impact.
Here are our results for the current set of recommendations:
The action palette is a set of verbs for each recommendation action type. For each action type, how many of these verbs agree with the fasttext action types?
Total Recommendations = 949
Number of recommendations that matched a verb from the Action Palette = 882
Number of recommendations that did not have a verb appear in it = 67
Certainly, this is a promising result. Given the number of recommendations that fail executability criteria, one would expect missing transitive verbs in at least 10% of recommendations if not more. Also, not all the recommendations under consideration are actionable.
So, the action palette should be a useful addition to our ability to identify pertinent recommendations at the point of care.
Using this set of verbs, we have produced a mechanism to predict action type in an automated fashion. We will combine this with fasttext to produce a mechanism for automated identification of action type. One of the byproducts of this process is the identification of under-specified recommendations. Recommendations that fail to be clearly executable are not easily implemented in an EHR. They may represent the state of clinical knowledge and provide clinical advice, but they do not clearly present a way to perform an activity. It may not have occurred to the developers or they intended to clarify a position on a particular topic, but these recommendations do not rise to the level of rigor that is required for this type of work, since we fail to be able to recognize the intended action.
We have discovered that our preliminary analysis of the Action Palette ignored a flaw in the model. There are numerous overlapping verbs for different action types. This distribution increased the probability of success but meant that it was not necessarily supporting the outcome. In NLP lingo, too many false positives. We will need to improve the verbs selected for each action type so that the categories do not overlap. It is also possible that combining our work with SNOMED codes could produce a better support mechanism.
Looking at the complete guideline as a source rather than a recommendation, we have begun an implementation of the fasttext algorithm, this time using the categories from the Guideline Central Summaries. A first pass will utilize 20 guidelines per category so that the distribution will be equivalent. Our goal is to correctly predict the category of subsequent guidelines.
A preliminary, (but time consuming), production and assessment of results for categorizing guidelines has not been promising. Cross categorization initially led to dispersion in the end results. We removed these overlapping guidelines and rebuilt the model with only uniquely categorized guidelines. We also utilized the complete set of summaries. Unfortunately, the results remained poorly correlated. A negative result, but lessons learned.
Since our initial results were done with the raw guidelines, performing NLP processing on the text might lead to better outcomes. We could apply ctakes to the guideline summary text before categorizing. Another approach to automating the categorization process might be to utilize action types in assigning guidelines to guideline categories. We could then determine the action types in a guideline’s recommendations and take the most frequent match as the correct category. Although overlap might be a concern, we would first have to assign action types to guideline categories.
Another line of inquiry that may prove useful is whether the relevance of a recommendation can better be explored through analysis of the recommendation text relative to the guideline text. Based on a vector representation of each recommendation in a guideline, what role does similarity play? If recommendation words represent a separate space from the L words of the complete Summary, should they have a separate recommendation basis in order to accurately compare the recommendations or do they need to be in a space that includes the L words? Is a recommendation similar to the other recommendations in the guideline or is it dissimilar? Is it indirectly related or is it central to the guideline text? How can we better rank the recommendations based on this assessment? Do recommendation strength and quality of evidence support this ranking?
To further the above development, we need to be able to link recommendation text to CPT codes for clinical procedures. A recommendation may have a set of assigned CUI’s, obtained through our above development, so we want to correlate those with CPT codes and determine the best one. We also have a crosswalk in the UMLS Metathesaurus, so we can utilize this to report the results.
We added MESH Codes, SNOMED Codes, CPT Codes, along with our cui codes. CPT codes are definitely lacking in number. We utilized the Bioportal restful service to capture the codes from text.
We have added a description of cTAKES and how it is used in our development work. Key takeaways are that it is based on a pipeline and that it is especially adapted to clinical NLP. It represents the starting point for Machine Learning but it is an important first step.
In looking at the recommendation in the context of a guideline, it appears useful to capture the relations between the two. As the recommendation contains the significant advice from the guideline developers, the guideline contains essential supporting materials. Does the similarity tell us more about the significance of each recommendation? If a recommendation is not fully described, perhaps it will not be implemented as intended. [Update: We have generated a table of the above comparison for a guideline and it recommendations. See Similarity – Rec – Guideline page. 11/29/19]
Recommendation Strength is a critical element to establish the degree to which the recommended action should be followed. We have created a training set that contains a label of recommendation strength (S for Strong or W for Weak) along with the recommendation text. The goal is to be able to judge whether the language used in a given recommendation is consistent with the corpus of recommendations and, if not, try to identify recommendations that may not convey the intended meaning of the developer .
Our work on Recommendation Strength was not conclusive and so we have moved on from it. An interesting concept that has come up is Lexical Density. How do recommendations rank as defined by the number of clinical terms over the total number of terms. This measure could be evaluated as a ratio of the number of terms that return a code set from the UMLS to the total number of terms in the text of interest, either recommendation or guideline. If the ratio is high then we have text this lexically dense. Does this impact the message? Does the text convey a clear and unambiguous message? Is this appropriate for the Intended Audience?
Are Choosing Wisely recommendations appropriate to implement without a complete available context, which is established by relevant metadata, such as the intended audience/user or target audience/population, etc., typically contained in the complete guideline or guideline summary? Should one implement a recommendation with only a partial or incomplete context? In a recent article [Zadro, J.R., Farey, J., Harris, I.A. et al. Do choosing wisely recommendations about low-value care target income-generating treatments provided by members? A content analysis of 1293 recommendations. BMC Health Serv Res19, 707 (2019) doi:10.1186/s12913-019-4576-1] the authors provide us with a corpus of recommendations where they have determined a number of properties: test or treatment, income generating or not, for or against, qualified or not, and member focused. Our interest is in using this corpus as a gold standard for each of these properties. Choosing Wisely recommendations are not typical guideline recommendations, but it is worth considering the machine learning models we can make with this data and extend the properties to our corpus. We have created a model using test or treatment and have produced excellent results. The author’s focus is on income-generating recommendations and it is not clear we can determine this from the text, but we will see if it is possible. For or against is fairly trivial, so we should have a good model for that property as well as the rest.
Another interesting question we wish to pursue is whether the similarity (such as cosine) of recommendations can be used to rank them in a guideline. That is, are highly similar recommendations more important than recommendations that are less similar. It would appear be true prima facie, but we wish to verify it and then compare it with the actual ordering in the guideline. We expect that being able to sort recommendations by degree of relevance will be a useful tool.
One aspect of machine learning that is ubiquitous is probability theory. As we develop more complex programs, it seems natural to turn to a programming language that is built for this task. We are investigating Gen which is written in the Julia programming language. Julia is a very powerful and yet user-friendly language. It builds on the past successes of other languages, which makes it ideal for our purposes. Although Gen is pre-Alpha and may not prove to be the future we seek, Julia certainly has the potential to be an important part of it and efforts to implement it will surely be worth pursing.
Pointwise Mutual Information (log base 2) was calculated based on our unigram and bigram tables. The results indicate that “should” is too ubiquitous to be meaningfully correlated. The PMI was determined from P(x,y) being equivalent to the frequency of the bigram appearing and P(x) and P(y) are frequencies taken from the unigram table. Our corpus is the set of conditional recommendation terms (1682 tokens). A more extensive set of PMI’s can be generated if we parse the raw text, since our n-gram table is highly filtered.
Our PMI results are very useful. We can clearly see the nondeterministic nature of some pairs, e.g., “should” occurs so many times that it becomes less correlated with any specific joining term. Whereas we see that high PMI values are associated with terms that do not occur on their own very frequently, such as “infantile spasms,” at least for our specific corpus of guideline recommendations. The significance of these results needs to be investigated further.
With our preliminary results, we expanded the PMI set to the whole corpus and have some very interesting results. High PMI values indicate more clinically relevant terms and less pedestrian, although clinical in nature, terms. Pairs which include terms such as “should” and “therapy” score lower PMI values, since they occur in locations outside the current pair being considered. Sorting of the table also allows one to see the progression of terms as they become more or less “clinical.”
Digitizing clinical guideline recommendations involves the identification of Action Type, Action(s), Condition(s), Logic, Recommendation Strength, and Evidence Quality. Further, successfully implementing recommendations requires that they be executable and decidable. We also need to identify its effect on the process of care, its degree of novelty, and finally its measurability. Each of these dimensions is described in detail in the GuideLine Implementability Appraisal instrument developed at the Yale Center for Medical Informatics. The reason they are important is that they answer the questions that implementers ask when implementing a clinical guideline recommendation. Questions such as, “How does this recommendation impact the users?”, “How does it impact my patients?”, and “Can we measure its use?” are all critical to a successful implementation strategy. We are going to create an interface that will allow us to capture this critical metadata and also support it with some of the enhancements we have been building.