Named Entity Recognition (NER) using Keras LSTM & Spacy

9 min readJun 7, 2021

How can we get useful information from massive unstructured documents? This question has been around for a long time before the named entity recognition (NER) model came out. This method can help people to extract key information for many different industries. This article will introduce and explain the methods used to solve the NER problem and shows the coding to build and train a bi-directional LSTM with Keras. On top of that, we will also demonstrate a NER model using Spacy.

NER, From https://confusedcoders.com/data-science/deep-learning/how-to-create-custom-ner-in-spacy

What is Named Entity Recognition?

NER seeks to extract and classify words into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, etc. NER can be used in natural language processing (NLP) to help answer real-world problems. This can be applied to recognize and parse important information from resumes, search for specific products mentioned in complaints or reviews, look for a company name in a news article, and many other uses. Apart from being used as an information extraction tool, it is also a preprocessing step for many NLP applications like machine translation, question answering, and text summarization. Now, let’s take a look a the different deep learning approaches to solve the NER problem.

Example 1: Named Entity Recognition (NER) using LSTMs with Keras

Deep Learning Approach for Sequential Data:

RNN

A well-studied solution for neural network problems to process variable-length input and allow past information to persist is the recurrent neural network (RNN). This deep learning algorithm is commonly used for NLP. The information travels through the network in a sequential manner. The first words get transformed into machine-readable vectors. Then the RNN processes the sequence of vectors one by one.

RNN, From: https://www.youtube.com/watch?v=8HyCNIVRbSU&t=1s

While processing, a chunk of the neural network, took some input x, and outputs a value h. Then, it passes the previous hidden state to the next step of the sequence. The hidden state acts as the memory of the previous neural network.

Looking at a simple cell in RNN, the input and previous hidden state are combined to form a vector. Then the vector goes through the tanh activation, and the output is the new hidden state or the memory of the network. The tanh activation is used to help regulate the values flowing through the network. It squishes values between -1 and 1.

By transforming values between -1 and 1, this means that past information can be easily vanished by being multiplied many times by very small numbers, causing the long-term dependencies problem. As such, the previous state that is influencing the current prediction is lost as it propagates over time.

2. LSTM

Long Short Term Memory networks are a special kind of RNN capable of learning long-term dependencies. The network is designed to avoid the vanishing gradient problem. An LSTM unit has different gates to learn which data in a sequence is important to keep or throw away.

LSTM, From: https://www.youtube.com/watch?v=8HyCNIVRbSU&t=1s

Forget Gate: As information travels through each chunk of the neural network, LSTM first decides what information we want to throw away using the forget gate. Passing value through the sigmoid function, values come out between 0 and 1. The closer to 0 means to forget, and the closer to 1 means to keep.

Input Gate: The next step is to decide what new information we are going to store in the cell state, using the input gate to update values. We also pass the hidden state and current input into the sigmoid and the tanh function to squish values to help regulate the network. Then we multiply the tanh output with the sigmoid output. The sigmoid output will decide which information is important to keep from the tanh output.

Cell State: the cell state gets pointwise multiplied by the forget vector. This has a possibility of dropping values in the cell state if it gets multiplied by values near 0. Then we take the output from the input gate and do a pointwise addition which updates the cell state to new values that the neural network finds relevant. Then it gives us our new cell state.

Output Gate: Finally, we decide what we are going to output based on the filtered information. First, we pass the previous hidden state and the current input into a sigmoid function. Then we pass the newly modified cell state to the tanh function. We multiply the tanh output with the sigmoid output to decide what information the hidden state should carry. The output is the new cell and hidden state, carried over to the next time step.

3. Bidirectional LSTM:

A bidirectional LSTM is a combination of two LSTMs — one runs forwards from right to left and one runs backward from left to right. This can provide additional context to the network and result in faster and even fuller learning on the problem to improve model performance on sequence classification problems. We will be using bidirectional LSTM with Keras to solve the NER problem.

Bidirectional LSTM, from: https://www.aclweb.org/anthology/Q16-1026.pdf

Building Bi-LSTM Model with Keras

The full code is available in our GitHub repository.

Step 1: we set up the packages:

Step 2: we load and explore the NER Dataset:

The dataset is from Kaggle, a highly cited dataset used to train NER projects. It is extracted from the Groningen Meaning Bank (GMB), comprises thousands of sentences and words tagged and built specifically to train the classifier to predict named entities. The dataset contains sentences in English and also an annotation for each word. Here is the essential information about each entity:

It has in total 35178 unique words in the corpus and 17 unique tags. From the plot below, we can see the word count within each tag group.

Step 3: Data Manipulation (integrate tokens from the same sentence):

Now we want to tokenize each word within the sentence and associated a respective Tag.

Step 4: Build a Vocabulary for Words and Tags:

Now we are going to build 2 vocabularies based on the words and tags from step 3.

Looking at the word2idx dictionary, each word/token is assigned to a unique index. So, rather than encoding each of these words, we are going to use the indexes and we can retrieve them by passing the indexes into the dictionary to return the corresponding keys.

Step 5: Padding the Input Sentence to the Same Length:

Using Keras, we need to pad each sentence to the same length before feeding it to the model. For example, our input only contains 20 words, we can pad the rest using random padding of zeros. From the sentence length distribution, we can see that the mean value is around 20 words per sentence.

We choose the padding to be 50, so most of the values do not need to be padded. Our X-vector will be a numerical representation of all our words and our y-vector (target) will be the tags associated with each word.

Step 6: Build and Compile a Bidirectional LSTM Model

First, we split the data into train and test sets. Then, we build our model using Tensorflow Keras. We start with an input layer of shape 50, defined in step 5. Then we add a layer for embeddings and apply a spatial drop out that can drop the entire 1D feature map across all the channels. Finally, we create our bidirectional LSTM.

The next step is to compile our model using an adam optimizer, sparse categorical cross-entropy loss, and an accuracy metric.

Adam: for stochastic gradient descent for training deep learning models
Categorical cross-entropy: used for multi-class classification problems.

Over here, we have the model summary plot and the total parameters are 1,879,750.

Step 7: Train the Model

Now we train the model and apply TensorBoard to check the detailed structure and performance.

Early Stopping: if the val_accuracy does not improve after 5 epochs, then stop training.

We apply model fit to train our model and use the test data to validate.

Over here, we have the model accuracy and loss using the train and test datasets. We can see that the accuracy is over 97% which is pretty high.

We could also use the TensorBoard callback to check the model performance. Over here we have the model accuracy and loss.

Step 8: Evaluate the performance of the Named Entity Recognition Model

Finally, we evaluate the performance of our model with our test data.

Test loss: 0.0956
Test Accuracy: 0.9789

Step 9: User Interface

We have also created a user interface for people to play around with the sentences. Here is a demonstration where people will be able to pick any sentences and find their respective word tags in step 5, Model performance evaluation.

Example 2: Named Entity Recognition Using SpaCy

Pre-trained Spacy Model

A simpler approach to solve the NER problem is to used Spacy, an open-source library for NLP. It provides features such as Tokenization, Parts-of-Speech (PoS) Tagging, Text Classification, and Named Entity Recognition. The detailed code on the Spacy Pre-trained Model is available in our GitHub repository. The example here is used to detect import keywords from resumes. The idea here is to use the NER method to identify relevant words within a tag category.

There are also many other pre-trained models such as the Stanford NER Model. These models are usually trained on a large dataset. As such it can be used in many more contexts and provide better performance.

Train Our Own Spacy Model:

We can also train our own model and teach the model to pick the keyword/ information that we want. The code is also available in our GitHub repository, we used a contracts dataset and trained the NER model to extract the important information from each contract.

Conclusion

Named Entity Recognition locates and defines unstructured words into their distinct categories. In this article, we have shown examples using bidirectional LSTM (BiLSTM) with Keras and Spacy to solve NER problems. BiLSTMs showed quite good results as they understand the context better by managing the inputs differently. SpaCy also worked as an exceptionally efficient tool for Natural Language Processing and a great tool in processing and “understanding” large volumes of text. We hope you now have a better understanding of training NER models after this article. Thanks for reading!

GitHub Repository

Link

References

https://www.aclweb.org/anthology/Q16-1026.pdf

https://www.youtube.com/watch?v=8HyCNIVRbSU&t=1s

https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0

https://towardsdatascience.com/named-entity-recognition-ner-meeting-industrys-requirement-by-applying-state-of-the-art-deep-698d2b3b4ede

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus

Teams

Fandi Yi - Data Scientist - CIBC | LinkedIn

View Fandi Yi's profile on LinkedIn, the world's largest professional community. Fandi has 6 jobs listed on their…

www.linkedin.com

Beiqi (Becky) Zhou - McGill University - Desautels Faculty of Management - Canada | LinkedIn

View Beiqi (Becky) Zhou's profile on LinkedIn, the world's largest professional community. Beiqi (Becky) has 4 jobs…

www.linkedin.com

Moding Xue - McGill University - Desautels Faculty of Management - Greater Montreal Metropolitan…

View Moding Xue's profile on LinkedIn, the world's largest professional community. Moding has 6 jobs listed on their…

www.linkedin.com

Named Entity Recognition (NER) using Keras LSTM & Spacy

What is Named Entity Recognition?

Example 1: Named Entity Recognition (NER) using LSTMs with Keras

Deep Learning Approach for Sequential Data:

Building Bi-LSTM Model with Keras

Example 2: Named Entity Recognition Using SpaCy

Pre-trained Spacy Model

Train Our Own Spacy Model:

Conclusion

GitHub Repository

References

Teams

Fandi Yi - Data Scientist - CIBC | LinkedIn

View Fandi Yi's profile on LinkedIn, the world's largest professional community. Fandi has 6 jobs listed on their…

Beiqi (Becky) Zhou - McGill University - Desautels Faculty of Management - Canada | LinkedIn

View Beiqi (Becky) Zhou's profile on LinkedIn, the world's largest professional community. Beiqi (Becky) has 4 jobs…

Moding Xue - McGill University - Desautels Faculty of Management - Greater Montreal Metropolitan…

View Moding Xue's profile on LinkedIn, the world's largest professional community. Moding has 6 jobs listed on their…

Yuan Kai Ma - McGill University - Desautels Faculty of Management - Montreal, Quebec, Canada |…

View Yuan Kai Ma's profile on LinkedIn, the world's largest professional community. Yuan Kai's education is listed on…

Written by Beiqi Zhou

Responses (1)