Sequence Learning and NLP with Neural Networks
Sequence Regression | Question Answering |
Sequence Classification | Language Modeling |
Sequence-to-Sequence Learning |
Sequence learning refers to a variety of related tasks that neural nets can be trained to perform. What all these tasks have in common is that the input to the net is a sequence of some kind. This input is usually variable length, meaning that the net can operate equally well on short or long sequences.
- A variable-length array, encoded from a string using a "Characters" or "Tokens" NetEncoder.
- A variable-length array, encoded from an Audio object using an "Audio", "AudioSpectrogram", "AudioSTFT", etc. NetEncoder.
- Fixed-length forms of the above, e.g. by using the "TargetLength" option to the NetEncoder.
What distinguishes the various sequence learning tasks is the form of the output of the net. Here, there is wide diversity of techniques, with corresponding forms of output:
- For autoregressive language models, used to model the probability of
of a particular sequence x, the output is the next element of the sequence. In the case of a textual model, this is a character or token, as decoded via a "Class", "Characters" or "Token" NetDecoder.
- For sequence tagging models, the output is a sequence of classes of the same length as the input. For example, in the case of a part-of-speech tagger, these classes are "Noun", "Verb", etc. For this, a "Class" NetDecoder is appropriate.
- For translation models, e.g. an English to French translator, the output is itself a language model, albeit one that is conditional on the source sequence. In other words, there are two inputs to the net: the complete source sequence, and the target sequence so far, and the output is a prediction of the next element of the target sequence.
- For CTC models, the input sequence is used to form a sequence of intermittent predictions for the target sequence, which is always shorter than the input sequence. Examples of this include handwriting recognition from pixel or stroke data, in which the input is segmented into individual characters, or audio transcription, in which features of the audio are segmented into characters or phonemes. For these, a "CTCBeamSearch" NetDecoder must be used.
Integer Addition
In this example, a net is trained to add two two-digit integers together. What makes the problem hard is that the inputs are strings rather than numeric values, whereas the output is a numeric value. For example, the net takes as input "25+33" and needs to return a real number close to 58. Note that the input is variable length, as the training data contains examples of length 3 ("5+5"), length 4 ("5+10") and length 5 ("10+10").
Create training data based on strings that describe two-digit additions and the corresponding numeric result:
In[1]:=1

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-teewf4

Create a net composed of a chain of recurrent layers to read an input string and predict the numeric result:
In[5]:=5

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-738g37
In[6]:=6

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-sorg07
Out[6]=6

Out[7]=7

In[8]:=8

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-8lyyd4
Out[8]=8

Sentiment Analysis
In[9]:=9

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-ocr2tp
In[11]:=11

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-yzg2ws
Out[11]=11

In[12]:=12

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-vbkyro
Out[12]=12

In[13]:=13

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-5ttk1u
Out[13]=13

Out[14]=14

In[15]:=15

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-79n2ow
Out[15]=15

In[16]:=16

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-z8p1k1
Out[16]=16

Sequence-to-sequence learning is a learning task where both the input and the predicted output are sequences. Tasks such as translating German to English, transcribing speech from an audio file, sorting lists of numbers, etc. are examples of this task.
Integer Addition with Fixed-Length Output
This example demonstrates how to train nets that take a variable-length sequence as input and predict a fixed-length sequence as output. We take a string that describes a sum, e.g. "250+123", and produce an output string that describes the answer, e.g. "373".
Create training data based on strings that describe three-digit additions and the corresponding result as a string. In order for the output to be fixed length, all outputs are padded to the maximum length of 4 (as the maximum value is 999+999=1998):
In[17]:=17

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-luokq3
In[21]:=21

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-dtln8o

In[22]:=22

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-87v72g
Out[22]=22

In[26]:=26

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-1ucvr0
Out[26]=26

Out[27]=27

In[28]:=28

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-8j6izq
Out[28]=28

In[29]:=29

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-cvb1je
Out[30]=30

Integer Addition with Variable-Length Output
This example demonstrates how to train nets on sequences where both the input and output are variable-length sequences, and those sequences are not the same. One sophisticated example of this task is translating English to German, but the example we cover is a simpler problem: taking a string that describes a sum, e.g. "250+123", and producing an output string that describes the answer, e.g. "373". The method used is based on I. Sutskever et al., "Sequence to Sequence Learning with Neural Networks", 2014.
Create training data based on strings that describe three-digit additions and the corresponding result as a string:
In[1]:=1

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-cqc10x
In[35]:=35

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-2l294g
Out[35]=35

Create a NetEncoder that uses a special code for the start and end of a string, which will be used to indicate the beginning and end of the output sequence:
In[5]:=5

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-nsz3qp
Out[5]=5

In[6]:=6

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-ycy8pm
Out[6]=6

In[38]:=38

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-fs7os7
Out[38]=38

In[39]:=39

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-vvlg1j
Out[39]=39

Define a net that takes an input vector of size 150 and a sequence of vectors as input and returns a sequence of vectors as output:
In[40]:=40

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-nkmcun
Out[40]=40

Define a net with a CrossEntropyLossLayer and containing the encoder and decoder nets:
In[41]:=41

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-0v6mp1
Out[41]=41

In[42]:=42

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-4rjpia
Out[42]=42

Out[43]=43

In this case of three-digit integer addition, there are only 1999 possible outputs. It is feasible to calculate the loss for each possible output and find the one that minimizes the loss:
In[44]:=44

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-gxwkb2
In[45]:=45

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-0ptkbf
Out[45]=45

In[46]:=46

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-d87s94
Out[47]=47

A more efficient way of obtaining predictions is to generate the output until the EndOfString virtual character is reached.
First, extract the trained "encoder" and "decoder" subnets from the trained NetGraph, and attach appropriate encoders and decoders:
In[48]:=48

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-0e98yl
Out[48]=48

In[49]:=49

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-zyaenc
Out[49]=49

Use a SequenceLastLayer to make a version of the decoder that only produces predictions for the last character of the answer, given the previous characters. Here a character decoder is attached for the probability vector using the same alphabet as the "Target" input:
In[50]:=50

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-eiek0g
Out[51]=51

In[52]:=52

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-hsvzl4

In[54]:=54

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-e2xp9t

Now define a prediction function that takes the "encoder" and "decoder" nets and an input string. The function will compute successively longer results until the decoder claims to be finished:
In[55]:=55

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-62vtd4
In[56]:=56

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-rsqhml

In[57]:=57

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-zf04p0
Out[58]=58

In[59]:=59

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-gcojho
Out[59]=59

The naive technique of generating by passing in each partial answer to the decoder net to derive the next character has time complexity of n2, where n is the length of the output sequence, and so is not appropriate for generating longer sequences. NetStateObject can be used to generate with time complexity of n.
First, a decoder is created that takes a single character and predicts the next character. The recurrent state of the GatedRecurrentLayer is handled by a NetStateObject at a later point.
In[60]:=60

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-lr2cf3
Out[60]=60

Out[61]=61

Define a "Class" encoder and decoder that will encode and decode individual characters, as well as the special classes to indicate the start and end of the string:
In[62]:=62

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-d0t5ba
Define a net that takes a single character, runs one step of the GatedRecurrentLayer, and produces a single softmax prediction:
In[65]:=65

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-gfx4pw
Out[65]=65

This predictor has an internal recurrent state, as revealed by Information:
In[66]:=66

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-fldfma
Out[66]=66

Create a function that uses NetStateObject to memorize the internal recurrent state, which is seeded from the code produced by the trained encoder:
In[67]:=67

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-5pfv6s
In[68]:=68

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-znx6rj

In[69]:=69

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-oulve6
Out[69]=69

In[70]:=70

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-nxqm65
Out[71]=71

Integer Sorting
In this example, a net is trained to sort lists of integers. For example, the net takes as input {3,2,4} and needs to return {2,3,4}. This example also demonstrates the use of a AttentionLayer, which significantly improves the performance of neural nets on many sequence learning tasks.
In[72]:=72

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-njdc95
In[76]:=76

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-qfv88q
Out[76]=76

In[77]:=77

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-wvkti7
Out[77]=77

In[78]:=78

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-kwgxp9
Out[78]=78

Out[79]=79

In[80]:=80

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-zm2ga9
Out[80]=80

Optical Character Recognition (OCR) on a Toy Dataset
The optical character recognition problem is to take an image containing a sequence of characters and return the list of characters. One simple approach is to preprocess the image to produce images containing only a single character and do classification. This is a fragile approach and completely fails for domains such as cursive handwriting, where the characters run together.
First, generate training and test data, which consists of images of words and the corresponding word string:
In[81]:=81

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-yn4k7p
In[82]:=82

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-johegr
In[87]:=87

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-9r1ob6
Take a RandomSample of the training set:
In[88]:=88

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-pc0mi9
Out[88]=88

In[89]:=89

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-jnhflk
Out[89]=89

In[90]:=90

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-9prtsn
Out[90]=90

Define a net that takes an image and then treats the width dimension as a sequence dimension. A sequence of probability vectors over the width dimension is produced:
In[91]:=91

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-8p0eem
Out[91]=91

In[92]:=92

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-e0sprf
Out[92]=92

In[93]:=93

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-oey19x
Out[93]=93

Out[94]=94

In[95]:=95

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-1eey69
Out[95]=95

In[96]:=96

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-knwr3z
Out[96]=96

In[97]:=97

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-r6c4mu
Out[97]=97

Simple RNN Trained on the bAbI QA Dataset
Train a question-answering net on the first task (Single Supporting Fact) of the bAbI QA dataset using a recurrent network.
In[98]:=98

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-qw1lp1
Out[98]=98

In[99]:=99

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-g32lzv
In[101]:=101

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-skbdwg
Out[101]=101

In[102]:=102

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-kpx13l
Out[102]=102

In[103]:=103

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-oqgnei
Out[103]=103

In[104]:=104

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-bvwpd8
Out[105]=105

Train the network for three training rounds. NetTrain will automatically attach a CrossEntropyLossLayer using the same classes that were provided to the decoder:
In[106]:=106

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-12pqs8
Out[106]=106

Out[107]=107

In[108]:=108

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-d3b1my
Out[108]=108

In[109]:=109

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-lgikzk
Out[109]=109

Memory Network Trained on the bAbI QA Dataset
Train a question-answering net on the first task (Single Supporting Fact) of the bAbI QA dataset, using a memory network based on Sukhbaatar et al., "End-to-End Memory Networks", 2015.
In[110]:=110

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-v9zz2a
Out[110]=110

In[111]:=111

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-gtwky0
In[113]:=113

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-co5b5w
Out[113]=113

The memory net has layers (such as TransposeLayer) that currently do not support variable-length sequences.
Convert all strings to lists of tokens and use left padding (which has better performance than right padding for this example) to ensure these lists are the same length:
In[114]:=114

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-u8311v
In[117]:=117

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-ntsvh8
In[121]:=121

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-d610v4
Out[121]=121

In[122]:=122

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-1n9b0f
Out[122]=122

In[123]:=123

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-f2jf5w
Out[123]=123

In[124]:=124

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-bqqjsu
Train the net using the "RMSProp" optimization method, which improves learning performance for this example:
In[127]:=127

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-vnohvm
Out[127]=127

Out[128]=128

In[129]:=129

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-pdhqwh
Out[129]=129

Character-Level Language Model
In[1]:=1

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-mvbupv
In[2]:=2

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-1kl129
The data is of the form of a classification problem: given a sequence of characters, predict the next one. A sample of the data:
In[133]:=133

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-8jfew7

In[4]:=4

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-7280lz
Out[4]=4

In[135]:=135

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-xklmy7
Out[135]=135

Train the net. This can take up to an hour on a CPU, so use TargetDevice->"GPU" if you have an NVIDIA graphics card available. A modern GPU should be able to complete this example in about 7 minutes:
In[136]:=136

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-h5r4f5
Out[136]=136

In[137]:=137

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-l16rp9
Out[137]=137

In[138]:=138

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-qpr8fs
Out[138]=138

Generate 100 characters of text, given a start text. Note that this method uses NetStateObject to efficiently generate long sequences of text:
In[5]:=5

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-0u6153
In[140]:=140

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-8365bs
Out[140]=140

In[6]:=6

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-4ln4uh
In[142]:=142

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-1xrxrq
Out[142]=142

An alternative and equivalent formulation of this learning problem, requiring only a single string as input, is to separate the last character from the rest of the sequence inside the net.
Use SequenceMostLayer and SequenceLastLayer in a graph to separate the last character from the rest:
In[143]:=143

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-3slrr5
Out[143]=143

Train this net on the input sequences from the original training data (technically, this means you end up training the net on slightly shorter sequences):
In[144]:=144

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-l40swx
Out[144]=144

Out[145]=145

A more efficient way to train language models is to use "teacher forcing", in which the net simultaneously predicts the entire sequence, rather than just the last letter.
First, build the net that does prediction of an entire sequence at once. This differs from the previous prediction nets in that the LinearLayer is mapped and a matrix softmax is performed, instead of taking the last element and doing an ordinary vector softmax:
In[7]:=7

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-q3vvn0
Out[8]=8

Now build the forcing network, which takes a target sentence and presents it to the network in a "staggered" fashion: for a length-26 sentence, present characters 1 through 25 to the net so that it produces predictions for characters 2 through 26, which are compared with the real characters via the CrossEntropyLossLayer to produce a loss:
In[9]:=9

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-m5vusp
Out[9]=9

Train the net on the input sequences from the original data. On a typical CPU, this should take around 15 minutes, compared to around 2 minutes on a modern GPU. As teacher forcing is a more efficient technique, you can afford to use a smaller number of training rounds:
In[10]:=10

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-vb71nu
Out[10]=10

In[11]:=11

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-dd5f9a
Out[11]=11

In[12]:=12

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-ufj0zz
Out[12]=12

In[13]:=13

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-ta35x7
Out[13]=13

Generate 100 characters of text, given a start text. Note that this method uses NetStateObject to efficiently generate long sequences of text:
In[14]:=14

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-k1n1fu
Out[14]=14

In[15]:=15

✖
https://wolfram.com/xid/0b9lc6pm29dxxc7gn2z9pqq-xjzy85
Out[15]=15
