Process Flow

Guideline documents are edited using GEM Cutter to form GEM Documents. These are then uploaded to a repository using a program to extract the essential elements. The next step is to run Apache cTAKES, which is an UIMA-based NLP processor for clinical documents that creates annotations for the guideline text. These annotations, which include SNOMED codes, are then stored in the repository for later retrieval. The final step is to create SVM classifiers based on training sets created by clinical experts. Examples are being designed and developed and will be made available as soon as they are ready.

An important initiative that we are pursuing is the automated identification of clinical unique identifier (CUI) and ICD10 codes with recommendation text. The challenge is to optimize the set of codes that are related to the recommendation text. A single word or token can produce a wide range of returned CUI’s. If we submit bi-grams and tri-grams, we receive a different but related set of codes. We have utilized similarity measures to help reduce the result set without loss of information. We are seeking an automated strategy to optimize this set of results. Currently we have employed cTAKES (to produce an annotated recommendation), n-grams, and similarity measures (Cosine and Sorenson-Dice coefficient) to reduce the matches generated from search results.

N-grams are also important in that they allow for the calculation of probabilities, since we can convert frequency counts to probabilities after normalizing. An expression that has a high probability of occurring, which is then found to occur in some unidentified text, will be a clear indicator of its likelihood to be related to the text in question. Combined with the use of other similarity measures, we aim to form an optimized system for identifying recommendations that are related to a clinical task.