A text processing pipeline to extract recommendations from radiology reports
by Yetisgen-Yildiz et. all

Natural Language Processing

Summary of article


Text processing approach based on natural language processing (NLP) and machine learning to identify sentences that involve clinically important recommendation information in radiology reports


Existing methods


MedLEE
  • Extracts and structures clinical information from radiology report text
  • Translates clinical information to terms in a controlled vocabulary
  • Translates clinical information accessed by further automated procedures

SAPPHIRE
  • Matches text to concepts in the Unified Medical Language System (UMLS) Metathesaurus
  • Indexes radiology reports automatically
  • Develop clinical image repositories that can be used for patient care and medical education
  • Note: One of the precursors to MetaMap

Lexicon Mediated Entropy Reduction (LEXIMER)
  • Identifies reports that include clinically important findings and recommendations for subsequent action
  • Black box approach


Pipeline


Paper pipeline

Disclaimer: I will use the word feature over and over again in explaining; this just means variable in Bioinformatics speak.


Section segmentation

  • Divides radiology report into 11 main sections
  • Classifier operates at line level instead of sentence level since content of clinical records tends to be fragmentary and list based.

Pre-processing step: Create list of sections by randomly selecting small subset of reports

  1. Construct ontology of section categories

  2. Randomly select reports from corpus

  3. Annotate section boundaries

  4. Assign each section a category from ontology

  5. Group sections:

    • Similar sections together
    • Rare sections in catch-all category

First step post-processing: Section Segmentation

  1. Label each line with text and tag features:
    • B (beginning of section)
    • I (inside of section), or
    • O (outside of section) tags

  2. Unlabeled sections passed on to second step (Section Classification) and labeled with section category according to features

Table 1: Section Segmentation
TypeFeatures
Text featuresisAllCaps, isTitleCaps, containsNumber, beginsWithNumber, numTokens, numPreBlanklines, numPostBlanklines, firstToken, secondToken, unigram
Tag featuresprevTag, prevTwoTags, tagChainLength

Second step post-processing: Section classification

  1. Label each section with section category using Header, Body, and Tag features

  2. Some methods used:
    • Classification: Maximum Entropy(MaxEnt) models
    • Finding good tag sequence: Beam search
    • L-BFGS parameter estimation and Gaussian prior smoothing: MALLET toolkit

  3. Measure overall performance with 5-fold cross validation

Table 2: Section Classification
TypeFeatures
Header featuresSame as Text features, only the header line is used
Body featuresavgLineLength, numLines, docPosition, containsList, unigram
Tag featuresprevTag, tagHistUnigram, tagChainLength


Sentence segmentation

  • Goal: chop text out into individual sentences to be able to identify individual sentences as positive or negative
  • Identify boundaries of sentences in section bodies using OpenNLP sentence chunker


Recommendation Extraction

Recommendation Extraction

Goal: label each sentence as positive or negative recommendation sentence

Creating feature vector

A feature vector is created based on the characteristics and content of the sentences in the report, and includes:
Table 3: Feature Vector
CategoryFeature TypeType of variableDimension
Baseline (B)unigramstringnum of words
Ngram (N)bigramstringnum of bigrams
Ngram (N)trigramstringnum of trigrams
Syntactic (S)tensecategoricalnum of tenses
Syntactic (S)stemmedVerbstringnum unique stemmed verbs
Syntactic (S)includesModalVerbbinarynum sentences
Syntactic (S)includesTemporalPhrasebinarynum sentences
Knowledge (K)UMLSConceptbinarynum UMLS concepts
Structural (St)sectionTypebinarynum section types

Explanation of variables:

  1. Baseline: Unigrams, single words.

  2. Ngram: Bigrams & trigrams, i.e. sequence of two and three words.

  3. Syntactic: Part of speech tags defined by passing unigrams into Stanford POS tagger. The following features (variables) are then created:
    • tense: Tense of a sentence (past, present, future, past-participle etc.)
    • stemmedVerb: Strips away the tense of a verb to give only the central meaning. Done by passing features of verbs into Porter Stemmer
    • includesModalVerb: Whether a sentence contains a Modal Verb, (e.g. "might", "may","shall")
    • includesTemporalPhrase: Whether a sentence contains a Temporal Phrase, (e.g. "for 15 minutes", "December 2012")

  4. Knowledge-based: UMLS concepts (medical terms) are extracted from report text through processing with MetaMap (a tool created by NLM) to map strings in free text to biomedical concepts in database. Synonymous concepts are grouped together with Concept Unique Identifier (CUI).
    • UMLSconcept: Matrix of indicators of UMLS concepts

  5. Structural: Each of the 11 sections is labelled and every line is matched to a section.
    • sectionType: Matrix of indicators of whether line is in particular section (e.g. background, findings).


Classifier training


Model Selection
Feature set size and type selection (subset selection).

Problem: Model too sparse.
Goal: Select most predictive variables to incorporate in final model.
Steps:
  1. Code distinct features to create huge design matrix.
  2. Stepwise variable selection to select number of variables (N) for baseline (unigrams).
  3. Stepwise variable selection to select combination of features.
  4. Rank features using 5-fold cross validation using F-score.
Results:Unigram (N=200) has highest F-score, so use that as baseline. Baseline+Syntactic and Baseline+Structural both give slightly improved predictability.

Data imbalance experiments

Problem: Proportion of positive: negative sentences is very low (1:165).
Goal: to select a ratio of positive: negative sentences that gives the optimal recall, precision & f-score.
Steps:
  1. Create feature vector for each sentence.
  2. Train 165 classifiers, each with ratio of i:1 negative: positive sentences, i = 1 to 165.
    For training fold, all positive sentences are used, while negative sentences are randomly selected to give the ratio i:1
    For test fold, all negative sentences are included (all 165 exp have same test set)
Results: When sets are balanced, recall is high, precision is low. As i increases, recall decreases, precision increases slowly. i = 44 has most balanced precision & recall while i = 9 has highest F-score.


Error Analysis
  • 5-fold cross validation (few positive sentences)
  • Criteria used include: false positives, false negatives, precision, recall, F-score