Notes on Deep Learning with R Chapter 6
Chapter 6 - Deep Learning for Text and Sequences
These are my notes on Chapter 6 of Deep Learning with R by Chollet and Allaire.
- Working with text
- Understanding recurrent neural networks
- Advanced use of RNNs
- Sequence processing with convnets
Working with text
You can work with text as either a sequence or characters or as a sequence of words. However, in either case deep learning for natural language is still just pattern recognition, the computer does not actually understand the text.
Neural networks cannot take plain text as input, but rather it must be vectorized into a numeric tensor. This can be accomplished in a few ways.
- Segment the text into words and then transform each word into a vector.
- Segment the text into characters and then transform each character into a vector.
- Exact n-grams of words or characters, and transform each n-gram into a vector.
Whichever unit you choose to break the text into is called a token. We must then represent these tokens as vectors. This can be done with one-hot encoding and token embedding.
One-hot encoding
A unique integer is associated with every word. Then a binary dictionary is made and the value of each word is set to 1 if it is present in the text and 0 otherwise.
Word embeddings
Instead of high-dimensional sparse vectors given by one-hot encoding, word embeddings gives low-dimensional floating point vectors. Word embeddings are learned from the data. This can be done in two ways.
- Learn word embeddings jointly with your main task.
- Start with random word vectors and then learn word vectors in the same way you learn weights in a network.
- Load into your model word embeddings that were precomputed.
Using an embedding layer
The simplest way to associate a dense vector with a word vector is to start with a random vector. The problem with this is that the resulting embedding space has no structure (e.g. synonyms may be embedded differently).
The geometric relationships between word vectors should reflect the semantic relationships between these words. You would expect synonyms to be embedded into similar word vectors. Generally, the geometric distance between two word vectors relates to the semantic difference between the associated words.
Direction of vector will also hold meaning. For example, a “gender” vector could take the word “king” to the word “queen” and a “plural” vector could take that to “queens”.
Understanding recurrent neural networks
Dense networks and convnets have no memory. If you wanted to analyze something like text or a time series, you would have to read the whole sequence into a single data point. This is called feedforward.
In contrast, recurrent neural networks (RNN) process sequences by iterating through the sequence elements, maintaining a state containing information relevant to what it ha seen so far (via a loop). This state is reset when processing each data sequence (e.g. each movie review).
LSTM
LSTM stands for Long Short-Term Memory. This helps old signals from gradually vanishing during processing (helping to address vanishing gradients).
Advanced use of RNNs
Recurrent dropout
The same drop out pattern should be applied at every step instead of a random drop out mask at every step.
Stacking recurrent layers
This will help until it starts overfitting, but it will take a long time to train.
Bidirectional RNN
Note that these are order dependent (e.g. must receive time stamps in order). Bidirectional RNNs exploits the order by processing the data forward and backward and then combining the representations.
Sequence processing with convnets
We used convnets to extra local features in images. We may also want to do that with sequence data. We can use a 1D convolution.
These are usually cheaper than RNNs and perform competitively. You can combine CNNs and RNNs to parse long sequences.