Wolfram Language & System Documentation Center

AttentionLayer

AttentionLayer[]

represents a trainable net layer that learns to pay attention to certain portions of its input.

AttentionLayer[net]

specifies a particular net to give scores for portions of the input.

AttentionLayer[net,opts]

includes options for weight normalization, masking and other parameters.

Details and Options

AttentionLayer[net] takes a set of key vectors, a set of value vectors and one or more query vectors and computes for each query vector the weighted sum of the value vectors using softmax-normalized scores from net[<|"Input"key,"Query"query|>].
In its general single-head form, AttentionLayer[net] takes a key array K of dimensions d₁×…×d_n×k, a value array V of dimensions d₁×…×d_n×v and a query array Q of dimensions q₁×…×q_m×q. The key and value arrays can be seen as arrays of size d₁×…×d_n whose elements are vectors. For K, these vectors are denoted k and have size k, and for V these vectors are denoted v and have size v, respectively. Similarly, the query array can be seen as an array of size q₁×…×q_m whose elements are vectors denoted q of size q. Note that the query array can be a single query vector of size q if m is 0. Then, the scoring net f is used to compute a scalar score s=f(k,q) for each combination of the d₁×…×d_n key vectors k and q₁×…×q_m query vectors q. These scalar scores are used to produce an output array O of size q₁×…×q_m containing weighted sums o=w_iv_i, where the weights are w=softmax(S), and S is the array of d₁×…×d_n scalar scores produced for a given query vector.
A common application of AttentionLayer[net] is when the keys are a matrix of size n×k, the values are a matrix of size n×v and the query is a single vector of size q. Then AttentionLayer will compute a single output vector o that is the weighted sum of the n-value row vectors: o=, where z_i=f(k_i,q). In the Wolfram Language, this can be written Total[values*weights], where weights is SoftmaxLayer[][Map[net[<|"Input"#,"Query"query|>]&,keys]].
In AttentionLayer[net], the scoring network net can be one of:

	"Dot"	a NetGraph computing s=Dot[k,q]
	"Bilinear"	a NetGraph computing s=Dot[k,W,q], where W is a learnable matrix (default)
	NetGraph[…]	a specific NetGraph that takes "Input" and "Query" vectors and produces a scalar "Output" value

The following optional parameters can be included:

"Dropout"	0	dropout rate for the attention weights
LearningRateMultipliers	Automatic	learning rate multipliers for the scoring network
"Mask"	None	prevent certain patterns of attention
"MultiHead"	False	whether to perform multi-head attention, where the penultimate dimension corresponds to different heads
"ScoreRescaling"	None	method to scale the scores

Possible settings for "Mask" are:
None no masking

"Causal" causal masking

"Causal"n local causal masking with a window of size n
Specifying "Dropout"p applies dropout with probability p on the attention weights, where p is a scalar between 0 included and 1 excluded.
With the setting "Mask""Causal", the query input is constrained to be a sequence of vectors with the same length as the key and the value inputs, and only positions t'<=t of the key and the value inputs are used to compute the output at position t.
With the setting "Mask""Causal"n, where n is a positive integer, only positions t-n<t'<=t of the key and the value inputs are used to compute the output at position t.
With the setting "MultiHead"True, key and value inputs must be at least of rank three, the query input must be at least of rank two, and the penultimate dimension should be the same for all inputs, representing the number of attention heads. Each attention head corresponds to a distinct attention mechanism, and the outputs of all heads are joined.
With the setting "ScoreRescaling""DimensionSqrt", the scores are divided by the square root of the key's input dimension before being normalized by the softmax: .
AttentionLayer is typically used inside NetGraph.
AttentionLayer exposes the following input ports for use in NetGraph etc.:

	"Key"	an array of size d₁×…×d_n×k (or d₁×…×d_n×h×k with multi-head attention)
	"Value"	an array of size d₁×…×d_n×v (or d₁×…×d_n×h×k with multi-head attention)
	"Query"	an array of size q₁×…×q_m×q (or q₁×…×q_m×h×k with multi-head attention)

AttentionLayer exposes an output port for use in NetGraph etc.:
"Output" an array of outputs with dimensions q₁×…×q_m×v (or q₁×…×q_m×h×v with multi-head attention)
AttentionLayer exposes an extra port to access internal attention weights:
"AttentionWeights" an array of weights with dimensions d₁×…×d_n×q₁×…×q_m (or d₁×…×d_n×h×q₁×…×q_m with multi-head attention)
AttentionLayer[…,"Key"shape₁,"Value"shape₂,"Query"shape₃] allows the shapes of the inputs to be specified. Possible forms for shape_i include:

	NetEncoder[…]	encoder producing a sequence of arrays
	{d₁,d₂,…}	an array of dimensions d₁×d₂×…
	{"Varying",d₁,d₂,…}	an array whose first dimension is variable and remaining dimensions are d₁×d₂×…
	{Automatic,…}	an array whose dimensions are to be inferred
	{"Varying",Automatic,…}	a varying number of arrays each of inferred size

The sizes of the key, value and query arrays are usually inferred automatically within a NetGraph.
AttentionLayer[…][<|"Key"key,"Value"value,"Query"query|>] explicitly computes the output from applying the layer.
AttentionLayer[…][<|"Key"{key₁,key₂,…},"Value"{value₁,value₂,…},"Query"{query₁,query₂,…}|>] explicitly computes outputs for each of the key_i, value_i and query_i in a batch of inputs.
AttentionLayer[…][input,NetPort["AttentionWeights"]] can be used to access the softmax-normalized attention weights on some input.
When given a NumericArray in the input, the output will be a NumericArray.
NetExtract[…,"ScoringNet"] can be used to extract net from an AttentionLayer[net] object.
Options[AttentionLayer] gives the list of default options to construct the layer. Options[AttentionLayer[…]] gives the list of default options to evaluate the layer on some data.
Information[AttentionLayer[…]] gives a report about the layer.
Information[AttentionLayer[…],prop] gives the value of the property prop of AttentionLayer[…]. Possible properties are the same as for NetGraph.

Examples

open all close all

Basic Examples (2)

Create an AttentionLayer:

Wolfram Language code: AttentionLayer[]

Create a randomly initialized AttentionLayer that takes a sequence of two-dimensional keys, three-dimensional values and a sequence of one-dimensional queries:

Wolfram Language code: attend = NetInitialize@AttentionLayer["Key" -> {"Varying", 2}, "Value" -> {"Varying", 3}, "Query" -> {"Varying", 1}]

Apply the layer to an input:

Wolfram Language code: attend[<|"Key" -> {{1, 2}, {3, 4}}, "Value" -> {{1, 2, 2}, {2, 1, 2}}, "Query" -> {{5}, {6}, {7}, {8}}|>]//MatrixForm

The layer threads across a batch of sequences of different lengths:

Wolfram Language code:

attend[<|"Key" -> {{{1, 2}, {3, 4}}, {{1, 2}, {3, 4}, {5, 6}}}, "Value" -> {{{1, 2, 2}, {2, 1, 2}}, {{1, 2, 2}, {2, 1, 2}, {2, 2, 1}}}, "Query" -> {{{5}, {6}, {7}}, {{5}, {6}, {7}, {8}}}|>]//Map[MatrixForm]

Scope (4)

Scoring Net (2)

Create an AttentionLayer using a "Dot" scoring net:

Wolfram Language code: attend = AttentionLayer["Dot"]

Extract the "Dot" scoring net:

Wolfram Language code: net = NetExtract[attend, "ScoringNet"]

Create a new AttentionLayer explicitly specifying the scoring net as a NetGraph object:

Wolfram Language code: attend2 = AttentionLayer[net]

Create a custom scoring net with trainable parameters:

Wolfram Language code:

net = NetGraph[{10, 10, ThreadingLayer[Tanh[#1 + #2]&], {}}, {NetPort["Input"] -> 1 -> 3, NetPort["Query"] -> 2 -> 3 -> 4}, "Input" -> 2, "Query" -> 3]

Create and initialize an AttentionLayer that makes use of the custom scoring net:

Wolfram Language code: attend = NetInitialize@AttentionLayer[net]

Apply the layer with a single query vector:

Wolfram Language code: attend[<|"Key" -> {{1, 2}, {3, 4}}, "Value" -> {{1, 2, 2}, {2, 1, 2}}, "Query" -> {0.1, 2.1, 1.2}|>]

Apply the layer with a sequence of queries:

Wolfram Language code: attend[<|"Key" -> {{1, 2}, {3, 4}}, "Value" -> {{1, 2, 2}, {2, 1, 2}}, "Query" -> {{0.1, 2.1, 1.2}, {4.2, 4, 3}}|>]

Attention Weights (2)

Create an AttentionLayer:

Wolfram Language code: attend = AttentionLayer["Dot"]

Compute attention weights on a given input:

Wolfram Language code:

input = <|"Key" -> {{-1, 0, 1}, {1, 1, 0}, {0, -1, 1}, {1, 1, -1}}, "Value" -> {{1, 2, 3}, {-2, 0, 2}, {3, 2, 0}, {3, 2, 1}}, "Query" -> {1, 2, 3}|>;

Wolfram Language code: attend[input, NetPort["AttentionWeights"]]

In this case, the weights correspond to:

Wolfram Language code: Function[N@Exp[#] / Total[Exp[#]]]@Dot[input["Key"], input["Query"]]

Compute both attention weights and outputs of the layer:

Wolfram Language code: attend[input, {NetPort["AttentionWeights"], NetPort["Output"]}]

Take a model based on AttentionLayer:

Wolfram Language code: net = NetModel["BERT Trained on BookCorpus and Wikipedia Data"]

This net contains several multi-head self-attention layers with 12 heads, for instance:

Wolfram Language code: NetExtract[net, {"encoder", 1, 1, "attention", 5}]

Extract the attention weights of this layer for a given input to the net:

Wolfram Language code: input = "The cat sat on the mat";

Wolfram Language code: weights = NumericArray@net[input, NetPort[{"encoder", 1, 1, "attention", 5, "AttentionWeights"}]]

Represent the weights of the first attention head as connection strengths between input tokens:

Wolfram Language code: tokens = {"[START]", Splice[StringSplit[input]], "[END]"}

Wolfram Language code:

head = 1;Graphics[Table[{Text[tokens[[i]], {-1, -i}], Text[tokens[[j]], {5, -j}], Opacity[weights[[i, head, j]]], Line[{{0, -i}, {4, -j}}]}, {i, Length[weights]}, {j, Last@Dimensions[weights]}]]

Options (7)

"Dropout" (1)

Define an AttentionLayer with dropout on attention weights masking:

Wolfram Language code: attendDrop = AttentionLayer["Dot", "Dropout" -> 0.5]

Without training-specific behavior, the layer returns the same result as without dropout:

Wolfram Language code:

input = <|"Query" -> (⁠|    |
| -- |
| 0  |
| 1  |
| -2 |⁠), "Key" -> (⁠|    |
| -- |
| 1  |
| 0  |
| -1 |⁠), "Value" -> (⁠|    |    |
| -- | -- |
| 1  | 2  |
| 3  | 4  |
| -5 | -6 |⁠)|>;

Wolfram Language code: attendDrop[input]//MatrixForm

Wolfram Language code: AttentionLayer["Dot"][input]//MatrixForm

With NetEvaluationMode"Train", the layer returns different results:

Wolfram Language code: attendDrop[input, NetEvaluationMode -> "Train"]//MatrixForm

Dropout is applied directly on attention weights:

Wolfram Language code: attendDrop[input, NetPort["AttentionWeights"], NetEvaluationMode -> "Train"]//MatrixForm

LearningRateMultipliers (1)

Make a scoring net with arbitrary weights:

Wolfram Language code: weights = {{2, -1}, {-1, 1}};

Wolfram Language code:

dot = NetGraph[{LinearLayer["Weights" -> weights, "Biases" -> None], DotLayer[]}, {NetPort["Input"] -> 1 -> 2, NetPort["Query"] -> 2}]

Use this scoring net in AttentionLayer, freezing its weights with the option LearningRateMultipliers:

Wolfram Language code: attend = AttentionLayer[dot, LearningRateMultipliers -> 0, "Value" -> {2, 2}, "Query" -> 2]

A zero learning rate multiplier will be used for the weights of the scoring net when training:

Wolfram Language code: Information[attend, "ArraysLearningRateMultipliers"]

Wolfram Language code: Information[NetChain[{attend, LinearLayer[3], SoftmaxLayer[]}], "ArraysLearningRateMultipliers"]

"Mask" (2)

Define an AttentionLayer with causal masking:

Wolfram Language code: attend = AttentionLayer["Dot", "Mask" -> "Causal"]

Apply the attention layer with one query vector and a sequence of length five:

Wolfram Language code:

input = <|"Query" -> (⁠|    |
| -- |
| 1  |
| 0  |
| -1 |
| -2 |
| 2  |⁠), "Key" -> (⁠|    |
| -- |
| -2 |
| -1 |
| 0  |
| 1  |
| 2  |⁠), "Value" -> (⁠|    |     |
| -- | --- |
| 1  | 2   |
| 3  | 4   |
| 5  | 6   |
| 7  | 8   |
| -9 | -10 |⁠)|>;

Wolfram Language code: attend[input]//MatrixForm

The output at a given step depends only on the keys and the values up to this step. In particular, the first output vector is the first vector of values.

The attention weights form a lower-triangular matrix:

Wolfram Language code: attend[input, NetPort["AttentionWeights"]]//MatrixForm

Define an AttentionLayer with local causal masking of window size 3:

Wolfram Language code: attend = AttentionLayer["Dot", "Mask" -> "Causal" -> 3]

Apply the attention layer with one query vector and a sequence of length five:

Wolfram Language code:

input = <|"Query" -> (⁠|    |
| -- |
| 1  |
| 0  |
| -1 |
| -2 |
| 2  |⁠), "Key" -> (⁠|    |
| -- |
| -2 |
| -1 |
| 0  |
| 1  |
| 2  |⁠), "Value" -> (⁠|    |     |
| -- | --- |
| 1  | 2   |
| 3  | 4   |
| 5  | 6   |
| 7  | 8   |
| -9 | -10 |⁠)|>;

Wolfram Language code: attend[input]//MatrixForm

The output at a given step depends only on the keys and the values from the last three steps.

This can be seen in the matrix of attention weights that contains zeros:

Wolfram Language code: attend[input, NetPort["AttentionWeights"]]//MatrixForm

"MultiHead" (2)

Define an AttentionLayer with two heads:

Wolfram Language code: multihead = AttentionLayer["Dot", "MultiHead" -> True, "Value" -> {"Varying", 2, 4}, "Query" -> {2, 5}]

Apply multi-head attention on one query vector and a sequence of length three:

Wolfram Language code:

key = RandomReal[1, {3, 2, 5}];
value = RandomReal[1, {3, 2, 4}];
query = RandomReal[1, {2, 5}];

Wolfram Language code: multihead[<|"Key" -> key, "Value" -> value, "Query" -> query|>]//MatrixForm

The result is the same as applying single-head attention separately on each head and joining the result:

Wolfram Language code: singlehead = AttentionLayer["Dot", "Value" -> {"Varying", 4}, "Query" -> {5}]

Wolfram Language code: singlehead[<|"Key" -> Transpose[key, 1  2], "Value" -> Transpose[value, 1  2], "Query" -> query|>]//MatrixForm

Define a NetGraph to perform multi-head self-attention with six heads:

Wolfram Language code:

multiheadself = NetInitialize@NetGraph[<|"Key" -> NetMapOperator[{6, 10}], "Value" -> NetMapOperator[{6, 10}], "Query" -> NetMapOperator[{6, 10}], "Attention" -> AttentionLayer["Dot", "MultiHead" -> True], "Merge" -> NetMapOperator[10]|>, {"Key" -> NetPort["Attention", "Key"], "Value" -> NetPort["Attention", "Value"], "Query" -> NetPort["Attention", "Query"], 
	"Attention" -> "Merge"}, "Input" -> {"Varying", 10}]

Apply to a NumericArray with a sequence of length three:

Wolfram Language code: multiheadself[NumericArray[RandomReal[1, {3, 10}], "Real32"]]

"ScoreRescaling" (1)

Create an AttentionLayer that rescales attention scores with respect to the input dimension:

Wolfram Language code: attend = AttentionLayer["Dot", "ScoreRescaling" -> "DimensionSqrt"]

Evaluate the layer on an input:

Wolfram Language code:

input = <|"Key" -> RandomReal[{-1, 1}, {7, 20}], "Value" -> RandomReal[{-1, 1}, {7, 1}], "Query" -> RandomReal[1, {3, 20}]|>;

Wolfram Language code: attend[input]//MatrixForm

The output is less contrasted than without score rescaling:

Wolfram Language code: AttentionLayer["Dot"][input]//MatrixForm

The attention weights are also less contrasted, even if their ordering remains the same:

Wolfram Language code:

ListLinePlot[{AttentionLayer["Dot"][input, NetPort["AttentionWeights"]][[1, All]], attend[input, NetPort["AttentionWeights"]][[1, All]]}, PlotLegends -> {"without score rescaling", "with score rescaling"}]

Applications (1)

To sort lists of numbers, generate a test and training set consisting of lists of integers between 1 and 6:

Wolfram Language code:

digits = Range[6];
seqs = RandomSample[Flatten[Table[Tuples[digits, n], {n, 3, 6}], 1]];
data = Map[# -> Sort[#]&, seqs];
{testData, trainData} = TakeDrop[data, Ceiling[Length[data] / 10]];

Display three random samples drawn from the training set:

Wolfram Language code: RandomSample[trainData, 3]

Define a NetGraph with an AttentionLayer:

Wolfram Language code:

net = NetGraph[<|
	"enc" -> {EmbeddingLayer[12], LongShortTermMemoryLayer[50]}, "dec" -> LongShortTermMemoryLayer[50], 
	"key" -> NetMapOperator[50], 
	"value" -> NetMapOperator[50], 
	"attend" -> AttentionLayer["Dot"], 
	"cat" -> CatenateLayer[2], "classify" -> {NetMapOperator[LinearLayer[]], SoftmaxLayer[]}
	|>, {
	"enc" -> "key" -> NetPort["attend", "Key"], "enc" -> "value" -> NetPort["attend", "Value"], 
	"enc" -> "dec" -> NetPort["attend", "Query"], 
	"dec" -> "cat", "attend" -> "cat", "cat" -> "classify"}, "Input" -> {"Varying", NetEncoder[{"Class", digits}]}, "Output" -> {"Varying", NetDecoder[{"Class", digits}]}
]

Train the net:

Wolfram Language code: trained = NetTrain[net, trainData, MaxTrainingRounds -> 3, ValidationSet -> testData]

Use the net to sort a list of integers:

Wolfram Language code: trained[{6, 6, 1, 4, 2}]

Properties & Relations (4)

If the query, key and value inputs are matrices, AttentionLayer[net] computes:

Wolfram Language code:

attentionFunction[inputKey_, inputValue_, query_, scorer_] := Table[Mean@WeightedData[inputValue, Exp[scorer[<|"Input" -> #, "Query" -> q|>]]& /@ inputKey], {q, query}];

Define an AttentionLayer and extract the scoring subnet:

Wolfram Language code:

att = NetInitialize@AttentionLayer["Key" -> {"Varying", 3}, "Value" -> {"Varying", Automatic}, "Query" -> {"Varying", 2}]

Wolfram Language code: scorer = NetExtract[att, "ScoringNet"]

Evaluate AttentionLayer on some test data:

Wolfram Language code:

data = <|"Key" -> {{1, 2, 3}, {-0.4, -1, 1.5}}, "Value" -> {{1, -2, 0}, {-0.4, 1, 0}}, "Query" -> {{0, -3}, {2, 3}, {-0.2, 0}, {5, 2}}|>;

Wolfram Language code: att@data

This is equivalent to:

Wolfram Language code: attentionFunction[data["Key"], data["Value"], data["Query"], scorer]

AttentionLayer[net,"ScoreRescaling""DimensionSqrt"] computes:

Wolfram Language code:

attentionFunction[inputKey_, inputValue_, query_, scorer_] := Table[
	Mean@WeightedData[inputValue, Exp[scorer[<|"Input" -> #, "Query" -> q|>] / Sqrt[Last@Dimensions[inputKey]]]& /@ inputKey]
	, {q, query}];

Define an AttentionLayer and extract the scoring subnet:

Wolfram Language code:

att = NetInitialize@AttentionLayer["ScoreRescaling" -> "DimensionSqrt", "Key" -> {"Varying", 3}, "Value" -> {"Varying", Automatic}, "Query" -> {"Varying", 2}]

Wolfram Language code: scorer = NetExtract[att, "ScoringNet"]

Evaluate AttentionLayer on some test data:

Wolfram Language code:

data = <|"Key" -> RandomReal[{-1, 1}, {4, 3}], "Value" -> RandomReal[{-1, 1}, {4, 2}], "Query" -> RandomReal[{-1, 1}, {4, 2}]|>;

Wolfram Language code: att@data//MatrixForm

This is equivalent to:

Wolfram Language code: attentionFunction[data["Key"], data["Value"], data["Query"], scorer]//MatrixForm

AttentionLayer[scorer,"Mask""Causal","ScoreRescaling""DimensionSqrt"] computes:

Wolfram Language code:

attentionFunction[inputKey_, inputValue_, query_, scorer_] := Table[
	Mean@WeightedData[inputValue[[ ;; i]], Exp[scorer[<|"Input" -> #, "Query" -> query[[i]]|>] / Sqrt[Last@Dimensions[inputKey]]]& /@ inputKey[[ ;; i]]]
	, {i, Length[query]}];

Define an AttentionLayer and extract the scoring subnet:

Wolfram Language code:

att = NetInitialize@AttentionLayer["Mask" -> "Causal", "ScoreRescaling" -> "DimensionSqrt", "Key" -> {"Varying", 3}, "Value" -> {"Varying", Automatic}, "Query" -> {"Varying", 2}]

Wolfram Language code: scorer = NetExtract[att, "ScoringNet"]

Evaluate AttentionLayer on some test data:

Wolfram Language code:

data = <|"Key" -> RandomReal[{-1, 1}, {4, 3}], "Value" -> RandomReal[{-1, 1}, {4, 2}], "Query" -> RandomReal[{-1, 1}, {4, 2}]|>;

Wolfram Language code: att@data//MatrixForm

This is equivalent to:

Wolfram Language code: attentionFunction[data["Key"], data["Value"], data["Query"], scorer]//MatrixForm

If "Key" and "Value" inputs are the same, AttentionLayer is equivalent to the deprecated SequenceAttentionLayer:

Wolfram Language code:

NetGraph[<|"attend" -> AttentionLayer[]|>, {NetPort["Input"] -> {NetPort[{"attend", "Key"}], NetPort[{"attend", "Value"}]}}]

Which is equivalent to:

Wolfram Language code: NetGraph[<|"attend" -> SequenceAttentionLayer[]|>, {}]

Possible Issues (1)

When using the setting "Dot" for the scoring net net in AttentionLayer[net], the input key and query vectors cannot be different sizes:

Wolfram Language code: AttentionLayer["Dot", "Key" -> {"Varying", 3}, "Value" -> {"Varying", 4}, "Query" -> {"Varying", 4}]

Using the same size:

Wolfram Language code: AttentionLayer["Dot", "Key" -> {"Varying", 4}, "Value" -> {"Varying", 3}, "Query" -> {"Varying", 4}]

This restriction does not apply to using a "Bilinear" scoring net:

Wolfram Language code: AttentionLayer["Bilinear", "Key" -> {"Varying", 3}, "Value" -> {"Varying", 4}, "Query" -> {"Varying", 17}]

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

AttentionLayer

Details and Options

Examples

Basic Examples (2)

Scope (4)

Scoring Net (2)

Attention Weights (2)

Options (7)

"Dropout" (1)

LearningRateMultipliers (1)

"Mask" (2)

"MultiHead" (2)

"ScoreRescaling" (1)

Applications (1)

Properties & Relations (4)

Possible Issues (1)

Text

CMS

APA

BibTeX

BibLaTeX

AttentionLayer

Details and Options

Examples

Basic Examples (2)

Scope (4)

Scoring Net (2)

Attention Weights (2)

Options (7)

"Dropout" (1)

LearningRateMultipliers (1)

"Mask" (2)

"MultiHead" (2)

"ScoreRescaling" (1)

Applications (1)

Properties & Relations (4)

Possible Issues (1)

See Also

Tech Notes

Related Guides

History

Text

CMS

APA

BibTeX

BibLaTeX