Concept Matching

The concept extraction pipeline maps concepts in narrative text to the UMLS Metathesaurus. The components in the pipeline are: tokenization; lexical normalization; UMLS Metathesaurus lookup; and concept screening.

First, the tokenization component, which was adapted from the openNLP suite for biomedical text, splits the query into multiple tokens.

Second, lexical normalization – converting words to a canonical form – is performed using an efficient in-memory data structure similar to a hash table. The Lexical Variant Generation (LVG) terminology from the UMLS metathesaurus is compressed by i) converting the terms to lowercase; ii) removing the terms where the normalized word has more  than one token; and iii) removing the terms that have the same base form.

Third, a UMLS Metathesaurus lookup is performed using a well-known efficient algorithm called Aho-Corasick string matching. Our implementation of the Aho-Corasick algorithm loads the normalized tokens and their substrings as the individual states of the corresponding finite state machine. The transitions between the different states represent the different terms formed by the original tokens.

Fourth, we select the UMLS concepts in the query that are members of the semantic groups treatment (Drug and Therapeutic or Preventive Procedure) or disorder (Abnormality, Dysfunction, Disease or Syndrome, Finding, Injury or Poisoning, Pathologic Function, and Sign or Symptom).