"AudioMFCC" (Net Encoder)

NetEncoder["AudioMFCC"]

represents an encoder that converts an audio file or object into its mel-frequency cepstral coefficients.

NetEncoder[{"AudioMFCC","param"->val,…}]

represents an encoder with specific parameters for preprocessing and feature computation.

Details

The "AudioMFCC" encoder computes the FourierDCT of the logarithm of each frame of the mel-spectrogram. Only the first few coefficients are kept. The Mel-Frequency Cepstral Coefficients (MFCC) manage to reduce the dimensionality of the feature very dramatically, while preserving a large amount of the information contained in the original signal, especially in the case of speech.
NetEncoder[…][input] applies the encoder to an input to produce a "Real32" output.
NetEncoder[…][{input₁,input₂,…}] applies the encoder to a list of inputs to produce a list of outputs.
When given a NumericArray as input, the output will be a NumericArray.
The input to the encoder can be an Audio object or a File[…] expression.
The output is computed by applying a discrete cosine transform to the mel-spectrogram, keeping only the first nc coefficients.
The output of the encoder is a rank-2 tensor of dimensions {n,nc}, where n is the number of partitions after the preprocessing is applied and nc is the number of coefficients used for the computation.
An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[…] when constructing the net.

Parameters

The following general parameters are supported:

"Augmentation"	None	augmentation to be applied
"Normalization"	None	whether to apply normalization
"SampleRate"	16000	target sample rate
"TargetLength"	All	target output length

Additional partitioning parameters:

"WindowSize"	Automatic	length of the partitions
"Offset"	Automatic	offset of the partitions
"WindowFunction"	Automatic	window to be applied to the partitions

Mel-spectrogram parameters:

"MaximumFrequency"	Automatic	maximum frequency of the mel filters
"MinimumFrequency"	Automatic	minimum frequency of the mel filters
"NumberOfFilters"	40	number of the mel filters

MFCC parameter:
"NumberOfCoefficients" 13 number of coefficients
The following settings and suboptions can be specified for each encoder parameter.
"Normalization" can take the following settings:

	None	no normalization
	"Max"	absolute maximum value normalized to 1
	{"Max",val}	absolute maximum value normalized to val
	{"RMS",val}	RMS of input audio signal normalized to val

"TargetLength" can take the following settings:
All same as input signal

dur the duration dur specified as a time quantity

n the first n partitions
If the specified "TargetLength" does not match the length of the input signal, padding or trimming are applied accordingly.
"Augmentation" can be specified as a list of rules with the following keys:

"Convolution"	None	convolves an impulse response to the input
"Noise"	None	adds noise to the input
"TimeShift"	None	shifts the input by a specified amount
"Volume"	None	multiplies the input with a constant
"VTLP"	None	applies vocal tract length perturbation to the input

Any augmentation parameter that accepts a numeric value can also be specified as a list of two numbers or a univariate distribution. In the first case, the value will be randomized according to a uniform distribution between the given bounds. In the second, the user-provided distribution will be used.
Possible values for "Convolution" include:
None no augmentation

signal File or Audio object to be convolved with input

{mix,signal} signal to be convolved with input and mix parameter
Possible values for "Noise" include:

	None	no augmentation
	amp	white noise with amplitude amp
	noise	File or Audio object containing the noise signal to be added
	{amp,noise}	noise signal and its with the specified amplitude

Use "TimeShift"->t to shift the input by t seconds, padding or trimming if necessary. Use Scaled[s] to shift the input by s×dur seconds, where dur is the duration of the input signal. Use {t₁,t₂} or Scaled[{ts₁,t₂}] to randomize the shift between the specified times.
Use "Volume"->val to specify a constant multiplier.
Vocal tract length perturbation (VTLP) multiplies the center values of the filter frequencies in the mel-spectrogram by a fixed amount. Use "VTLP"val to specify the amount of the perturbation.
With the parameter "WindowSize"->Automatic, a partition length of 25 milliseconds is used. Use "WindowSize"->dur to select a partition length of duration dur. Use "WindowSize"->n to select a partition length of n samples.
With the parameter "Offset"->Automatic, a partition offset of 8.33 milliseconds is used. Use "Offset"->dur to select a partition offset of duration dur. Use "Offset"->n to select a partition offset of n samples.
Parameter "WindowFunction" applies a window to each partition. Possible settings are:

	None	no windowing applied to the input audio
	Automatic
	func	the window is computed using the function func
	list	the sampled window list is explicitly specified

With the parameter "MinimumFrequency"->Automatic, a frequency is computed as Ceiling[sr/ws], where sr is the sample rate "SampleRate" and ws is the partition length "WindowSize". Use "MinimumFrequency"f to set the minimum frequency for the filters to f.
With the parameter "MaximumFrequency"->Automatic, a frequency is computed as Round[Min[8000,sr/2]]], where sr is the sample rate "SampleRate". Use "MaximumFrequency"f to set the maximum frequency for the filters to f.
With the parameter "NumberOfFilters"->n, n filters will be used in the computation of the MFCC.
With the parameter "NumberOfCoefficients"->n, n coefficients will be used in the computation of the MFCC.

Examples

open allclose all

Basic Examples (1)

Create an MFCC NetEncoder:

Create an Audio object:

Apply the encoder to the Audio object:

Plot the result:

Scope (3)

NetEncoder["AudioMFCC"] can encode either File or Audio objects. Create a mel-spectrogram encoder:

Apply the encoder to a File object:

Apply the encoder to an in-core Audio object:

Apply the encoder to an out-of-core Audio object:

Create a list of Audio objects:

NetEncoder["AudioMFCC"] maps across a batch of inputs:

Create an MFCC NetEncoder:

Attach the encoder to the input of a net:

Apply the net to an Audio object:

Parameters (9)

"Normalization" (1)

Create an Audio object:

Use an encoder with "Normalization"->None to avoid any normalization:

Since the normalization is applied to the signal before the spectrogram is computed, there are no guarantees on the bounds of the result:

Use an encoder with "Normalization""Max" to normalize the maximum absolute value of the waveform samples to 1.:

Find the minimum and maximum values of the result:

"SampleRate" (1)

Create an Audio object:

Using an encoder with "SampleRate"8000 resamples the signal to 8000Hz before performing the short-time Fourier transform:

"TargetLength" (1)

Create an Audio object:

Using an encoder with "TargetLength"All returns the mel-spectrogram for all the data:

Using an encoder with "TargetLength"10 zero-pads the output to be of length 10:

Using an encoder with "TargetLength"2 takes only the first two partitions:

"WindowSize" (1)

The partition length is by default 25ms:

"Offset" (1)

Create an Audio object:

The partition offset is automatically computed to be 1/3 of the partition length:

Using an encoder with "Offset"10 returns the MFCC computed using partitions with an offset of 10 samples:

"MinimumFrequency" (1)

Create an Audio object:

The minimum frequency is automatically computed to be Ceiling[sr/ws], where sr is the sample rate "SampleRate" and ws is the partition length "WindowSize":

Using an encoder with "MinimumFrequency"2000 returns the MFCC computed using filters whose minimum frequency is 2000Hz:

"MaximumFrequency" (1)

Create an Audio object:

The maximum frequency is automatically computed to be Round[Min[8000,sr/2]]], where sr is the sample rate "SampleRate":

Using an encoder with "MaximumFrequency"2000 returns the MFCC computed using filters whose maximum frequency is 2000Hz:

"NumberOfFilters" (1)

Create an Audio object:

By default, 40 filters are used for the computation of the MFCC:

Using an encoder with "NumberOfFilters"14 returns the MFCC computed using 14 filters:

"NumberOfCoefficients" (1)

Create an Audio object:

By default, 13 coefficients are used for the computation of the MFCC:

Using an encoder with "NumberOfCoefficients"40 returns the MFCC computed using 40 filters:

Properties & Relations (1)

Create an Audio object:

Create an MFCC NetEncoder:

The length of the result can be computed as Ceiling[length/offset], where length is the length of the signal after resampling and offset is the "Offset" parameter of the encoder:

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

"AudioMFCC" (Net Encoder)

Details

Parameters

Examples

Basic Examples (1)

Scope (3)

Parameters (9)

"Normalization" (1)

"SampleRate" (1)

"TargetLength" (1)

"WindowSize" (1)

"Offset" (1)

"MinimumFrequency" (1)

"MaximumFrequency" (1)

"NumberOfFilters" (1)

"NumberOfCoefficients" (1)

Properties & Relations (1)

	All	same as input signal
	dur	the duration dur specified as a time quantity
	n	the first n partitions

	None	no augmentation
	signal	File or Audio object to be convolved with input
	{mix,signal}	signal to be convolved with input and mix parameter

"AudioMFCC" (Net Encoder)

Details

Parameters

Examples

Basic Examples (1)

Scope (3)

Parameters (9)

"Normalization" (1)

"SampleRate" (1)

"TargetLength" (1)

"WindowSize" (1)

"Offset" (1)

"MinimumFrequency" (1)

"MaximumFrequency" (1)

"NumberOfFilters" (1)

"NumberOfCoefficients" (1)

Properties & Relations (1)

See Also

Tech Notes

Related Guides

History