.. _Training a model: Training a new named entity recogniser model ============================================ Pre-processing ~~~~~~~~~~~~~~ Pre-processing functions can be found in the :mod:`dose_instruction_parser.di_prepare` module. Several operations are performed on the input text to get it ready for the NER model: #. All parentheses are replaced with a blank space #. Hyphens ("-") and slashes ("/" and "\") and replaced with a blank space #. Certain keywords are replaced with alternatives e.g. "qad" -> "every other day". These combinations are listed in :mod:`dose_instruction_parser.data.replace_words` #. Spelling is corrected using the `pyspellchecker `_ package. Certain keywords are not corrected. These are listed in :mod:`dose_instruction_parser.data.keep_words` #. Number-words are converted to numbers using the `word2number `_ package, e.g. "two" -> "2"; "half" -> "0.5". #. Blank spaces are added around numbers e.g. "2x30ml" -> " 2 x 30 ml" #. Extra spaces between words are removed #. Leading and trailing whitespace is removed For example, pre-processing would have the following results: =============================== ================================ Input Output =============================== ================================ take two tabs MORNING and nghit take 2 tablets morning and night half cap qh 0.5 capsule every hour two puff(s) 2 puff one/two with meals 1 / 2 with meals =============================== ================================ Named Entity Recognition (NER) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The next step in processing is to identify parts of the instruction associated with each named entity. This is done using a neural network, which is a type of machine learning model. The neural network is implemented via the `spacy `_ package. At Public Health Scotland we have trained a model called :program:`en_edris9` to do NER. Due to data protection concerns the model is not currently publicly available. Please contact the eDRIS team on `phs.edris@phs.scot `_ to enquire about access. In the :program:`en_edris9` model there are nine named entities: * DOSAGE * FORM * ROUTE * DRUG * STRENGTH * FREQUENCY * DURATION * AS_REQUIRED * AS_DIRECTED :program:`en_edris9` was based on the `med 7 `_ model with the addition of two entities: "AS_REQUIRED" and "AS_DIRECTED". Preparing for training ^^^^^^^^^^^^^^^^^^^^^^ The code used to prepare and train the model can be found in the **model** folder. To generate :program:`en_edris9`, the :program:`en_med7` model was further trained on approximately 7,000 gold-standard tagged dose instructions. Each instruction was separately tagged by two eDRIS analysts, and any tagged instructions which didn't match identically were manually resolved by the team. This was to ensure high quality input data. There are three steps to prepare raw examples of dose instructions for training: #. Tag entities by hand #. Optional: cross check tagging if each example has been tagged twice by different people #. Convert tagged examples to :file:`.spacy` format Tag entities by hand '''''''''''''''''''' The tagging process was carried out using the desktop version of `NER Annotator for Spacy `_, which outputs tagged dose instructions in :file:`.json` format. An example of a :file:`.json` file with just two tagged dose instructions is: .. code:: {"classes":["DOSE","FORM","FREQUENCY","DURATION","ROUTE","DRUG","STRENGTH","AS_DIRECTED","AS_REQUIRED"], "annotations":[ ["1 tab in the morning",{"entities":[[0,1,"DOSE"],[2,5,"FORM"],[6,20,"FREQUENCY"]]}], ["1 cap 4 times daily",{"entities":[[0,1,"DOSE"],[2,5,"FORM"],[6,19,"FREQUENCY"]]}] ] } The numbers alongside each entity refer to the start and end positions of the entity in the text (where numbers start counting from 0). For example, for "1 tab in the morning", the the **FORM** entity is recorded as :program:`[2,5,"FORM"]` which corresponds to the 2nd up to but not including the 5th character in the text, e.g. "tab". Cross check tagging ''''''''''''''''''' Cross-checking tagging is done using the :file:`1-json_to_dat.py` script in :file:`model/preprocess/`. Tagged examples must first be copied to :file:`model/preprocess/tagged/`, and must be in .json format. The script outputs two :file:`.dat` files to :file:`model/preprocess/processed`: * :file:`crosschecked_data_\\{time\\}.dat` are examples where both taggers agree on all tags * :file:`conflicting_data_\\{time\\}.dat` are examples where taggers disagree on one or more tags You must then manually open up :file:`conflicting_data_\\{time\\}.dat` and resolve any conflicts, before saving out as :file:`resolved_data_\\{time\\}.dat`. Convert tagged examples to .spacy format '''''''''''''''''''''''''''''''''''''''' The crosschecked and resolved data are converted to :file:`.spacy` format by :file:`2-dat_to_spacy.py`. The instances are shuffled and split into train; test; dev data with a 8:1:1 split. This can be changed by editing the file. Data are saved out to :file:`model/data` in :file:`.spacy` format. Training ^^^^^^^^ Before training the model you need to define a :program:`DI_FILEPATH`` environment variable, which is the file path you will save and load models from. You should save this variable in a :file:`secrets.env` file in the :file:`dose_instructions_ner` folder. The contents of :file:`secrets.env` should be: .. code:: export DI_FILEPATH="/path/to/folder/" You can train the model by opening a Terminal and running: .. code:: cd model ./train_model.sh You will be taken through interactive steps in the Terminal to set the model name. The model parameters are defined in :file:`model/config/config.cfg`, which is a `spacy configuration file `_. There are a few important things to note about the contents: * The path to the training data is set under **\[paths\]** * The :program:`en_med7` model is used as a starting point for training. This is set using the **source** parameters under **\[components\]** and also in **\[initialize.before_init\]**. * The hyperparameters for the neural network are set under **\[training.optimizer\]**. The `Adam `_ optimiser is the default. * **\[training.score_weights\]** details the relative importance of different measure in evaluating training performance. Available measures are precision, recall and `F-score `_ (the harmonic mean of precision and recall). Model training logs will be saved to a :file:`logs`` folder within your :program:`DI_FILEPATH`. Model performance ^^^^^^^^^^^^^^^^^ You can evaluate the performance of a model by running the :file:`evaluate_model.sh` script in a Terminal from within the :file:`model` folder. You can either provide the name of the model you with to evaluate or the location .. code:: ./evaluate_model.sh This will produce a log in the :file:`logs` folder within :program:`DI_FILEPATH`. Adapting the model or training your own ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can adapt the model by training it again using additional training examples. To do this you need to install the :program:`en_edris9` model and amend the configuration file so that the starting model is :program:`en_edris9` rather than :program:`en_med7`. To train your own model you can follow similar steps, starting from any of :program:`en_med7`, :program:`en_edris9` or a standard language model like :program:`en_core_web_sm`. Refer to `spacy `_ documentation for more information.