Training a new named entity recogniser model
Pre-processing
Pre-processing functions can be found in the dose_instruction_parser.di_prepare
module.
Several operations are performed on the input text to get it ready for
the NER model:
All parentheses are replaced with a blank space
Hyphens (“-”) and slashes (“/” and “") and replaced with a blank space
Certain keywords are replaced with alternatives e.g. “qad” -> “every other day”. These combinations are listed in
dose_instruction_parser.data.replace_words
Spelling is corrected using the pyspellchecker package. Certain keywords are not corrected. These are listed in
dose_instruction_parser.data.keep_words
Number-words are converted to numbers using the word2number package, e.g. “two” -> “2”; “half” -> “0.5”.
Blank spaces are added around numbers e.g. “2x30ml” -> “ 2 x 30 ml”
Extra spaces between words are removed
Leading and trailing whitespace is removed
For example, pre-processing would have the following results:
Input |
Output |
---|---|
take two tabs MORNING and nghit |
take 2 tablets morning and night |
half cap qh |
0.5 capsule every hour |
two puff(s) |
2 puff |
one/two with meals |
1 / 2 with meals |
Named Entity Recognition (NER)
The next step in processing is to identify parts of the instruction associated with each named entity. This is done using a neural network, which is a type of machine learning model. The neural network is implemented via the spacy package.
At Public Health Scotland we have trained a model called en_edris9 to do NER. Due to data protection concerns the model is not currently publicly available. Please contact the eDRIS team on phs.edris@phs.scot to enquire about access.
In the en_edris9 model there are nine named entities:
DOSAGE
FORM
ROUTE
DRUG
STRENGTH
FREQUENCY
DURATION
AS_REQUIRED
AS_DIRECTED
en_edris9 was based on the med 7 model with the addition of two entities: “AS_REQUIRED” and “AS_DIRECTED”.
Preparing for training
The code used to prepare and train the model can be found in the model folder. To generate en_edris9, the en_med7 model was further trained on approximately 7,000 gold-standard tagged dose instructions. Each instruction was separately tagged by two eDRIS analysts, and any tagged instructions which didn’t match identically were manually resolved by the team. This was to ensure high quality input data.
There are three steps to prepare raw examples of dose instructions for training:
Tag entities by hand
Optional: cross check tagging if each example has been tagged twice by different people
Convert tagged examples to
.spacy
format
Tag entities by hand
The tagging process was carried out using the desktop version of NER Annotator for Spacy,
which outputs tagged dose instructions in .json
format. An example of a .json
file with just two tagged dose instructions is:
{"classes":["DOSE","FORM","FREQUENCY","DURATION","ROUTE","DRUG","STRENGTH","AS_DIRECTED","AS_REQUIRED"],
"annotations":[
["1 tab in the morning",{"entities":[[0,1,"DOSE"],[2,5,"FORM"],[6,20,"FREQUENCY"]]}],
["1 cap 4 times daily",{"entities":[[0,1,"DOSE"],[2,5,"FORM"],[6,19,"FREQUENCY"]]}]
]
}
The numbers alongside each entity refer to the start and end positions of the entity in the text (where numbers start counting from 0). For example, for “1 tab in the morning”, the the FORM entity is recorded as [2,5,"FORM"] which corresponds to the 2nd up to but not including the 5th character in the text, e.g. “tab”.
Cross check tagging
Cross-checking tagging is done using the 1-json_to_dat.py
script in model/preprocess/
. Tagged examples
must first be copied to model/preprocess/tagged/
, and must be in .json format. The script
outputs two .dat
files to model/preprocess/processed
:
crosschecked_data_{time}.dat
are examples where both taggers agree on all tagsconflicting_data_{time}.dat
are examples where taggers disagree on one or more tags
You must then manually open up conflicting_data_{time}.dat
and resolve any conflicts, before
saving out as resolved_data_{time}.dat
.
Convert tagged examples to .spacy format
The crosschecked and resolved data are converted to .spacy
format by 2-dat_to_spacy.py
.
The instances are shuffled and split into train; test; dev data with a 8:1:1 split. This can be changed
by editing the file. Data are saved out to model/data
in .spacy
format.
Training
Before training the model you need to define a DI_FILEPATH` environment variable, which is the
file path you will save and load models from. You should save this variable in a secrets.env
file
in the dose_instructions_ner
folder. The contents of secrets.env
should be:
export DI_FILEPATH="/path/to/folder/"
You can train the model by opening a Terminal and running:
cd model
./train_model.sh
You will be taken through interactive steps in the Terminal to set the model name.
The model parameters are defined in model/config/config.cfg
, which is a spacy
configuration file. There are a few important
things to note about the contents:
The path to the training data is set under [paths]
The en_med7 model is used as a starting point for training. This is set using the source parameters under [components] and also in [initialize.before_init].
The hyperparameters for the neural network are set under [training.optimizer]. The Adam optimiser is the default.
[training.score_weights] details the relative importance of different measure in evaluating training performance. Available measures are precision, recall and F-score (the harmonic mean of precision and recall).
Model training logs will be saved to a logs`
folder within your DI_FILEPATH.
Model performance
You can evaluate the performance of a model by running the evaluate_model.sh
script
in a Terminal from within the model
folder. You can either provide the name of the model
you with to evaluate or the location
./evaluate_model.sh
This will produce a log in the logs
folder within DI_FILEPATH.
Adapting the model or training your own
You can adapt the model by training it again using additional training examples. To do this you need to install the en_edris9 model and amend the configuration file so that the starting model is en_edris9 rather than en_med7.
To train your own model you can follow similar steps, starting from any of en_med7, en_edris9 or a standard language model like en_core_web_sm. Refer to spacy documentation for more information.