"Tokens" (Net Encoder)

NetEncoder["Tokens"]

represents an encoder that converts the words in a string to a sequence of integer codes using a standard English vocabulary.

NetEncoder[{"Tokens","language"}]

represents an encoder that uses a standard vocabulary for the given language.

NetEncoder[{"Tokens",{token1,token2,}}]

represents an encoder that uses a specified list of tokens as the vocabulary.

NetEncoder[{"Tokens",,"param"value}]

represents an encoder in which additional parameters have been specified.

Details

  • NetEncoder[][input] applies the encoder to an input to produce an output.
  • NetEncoder[][{input1,input2,}] applies the encoder to a list of strings to produce a list of outputs.
  • The input to the encoder must be a string or a TextElement with a sequence of strings that represents tokens. If it is a string, the segmentation into tokens will be done using a regular expression based on the value of "SplitPattern".
  • The output of the encoder is a sequence of integers between 1 and d+1, where d is the number of tokens in the vocabulary. The integer d+1 is used to signify tokens in the input that do not occur in the dictionary.
  • The type of the output NumericArray is the smallest unsigned integer that can represent all possible output integer values.
  • An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[] when constructing the net.
  • Parameters
  • The following parameters can be specified:
  • "IgnoreCase"Truewhether to ignore case when matching tokens from the string
    "SplitPattern"TemplateBox[{WordBoundary, paclet:ref/WordBoundary}, RefLink, BaseStyle -> {3ColumnTableMod}]the string pattern to use in order to split the input string into tokens
    "TargetLength"Allthe length of the final sequence to crop or pad to
  • With the parameter "IgnoreCase"->True, tokens are effectively converted to lowercase before encoding.
  • With the parameter "TargetLength"->All, all tokens found in the input string are encoded.
  • With the parameter "TargetLength"->n, the first n tokens found in the input string are encoded, with padding applied if fewer than n tokens are found. The padding value is d+1, where d is the number of tokens in the vocabulary.
  • With the parameter "SplitPattern"->None, the input to the encoder is assumed to be a pre-tokenized list of strings of the form {"token1","token2",}.

Examples

open allclose all

Basic Examples  (1)

Create a token encoder for English text:

Encode an English sentence:

Out-of-vocabulary words are encoded as the maximum code:

By default, words are detected using a simple regular expression:

The list of words can be explicitly passed using TextElement:

Scope  (6)

Use the default token encoder to encode a sentence:

Give a specific list of tokens:

Give a specific list of tokens, including a split pattern:

Specify that the sequence should be padded or trimmed to be 4 elements long:

Use a built-in dictionary for a specific language:

Use a custom tokenization with TextElement:

Use the output of TextStructure to compute a list of token indices:

A tree structure gets flattened:

Parameters  (3)

"IgnoreCase"  (1)

An encoder with "IgnoreCase"->True treats tokens that differ only by the case of their constituent characters as equivalent:

An encoder with "IgnoreCase"->False does not do this:

"SplitPattern"  (2)

Create an encoder that isolates digit characters, using "SplitPattern":

The encoder outputs one token for each digit character:

It is different from the default behavior, which gathers all consecutive digit characters together:

Create an encoder with "SplitPattern"->None and two tokens:

The encoder now expects a list of tokens as input:

The encoder still maps across a batch of examples: