Transformer Transducer Model

Transformer Transducer Model

class openspeech.models.transformer_transducer.model.TransformerTransducerModel(configs: omegaconf.dictconfig.DictConfig, vocab: openspeech.vocabs.vocab.Vocabulary)[source]

Transformer-Transducer is that every layer is identical for both audio and label encoders. Unlike the basic transformer structure, the audio encoder and label encoder are separate. So, the alignment is handled by a separate forward-backward process within the RNN-T architecture. And we replace the LSTM encoders in RNN-T architecture with Transformer encoders.

Parameters
  • configs (DictConfig) – configuraion set

  • vocab (Vocabulary) – vocab of training data

Inputs:
inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded

FloatTensor of size (batch, seq_length, dimension).

input_lengths (torch.LongTensor): The length of input tensor. (batch)

Returns

Result of model predictions.

Return type

  • y_hats (torch.FloatTensor)

decode(encoder_outputs: torch.Tensor, max_length: int)torch.Tensor[source]

Decode encoder_outputs.

Parameters
  • encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size (seq_length, dimension)

  • max_length (int) – max decoding time step

Returns

model’s predictions.

Return type

  • y_hats (torch.IntTensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Dict[str, torch.Tensor][source]

Decode encoder_outputs.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

Result of model predictions.

Return type

  • outputs (dict)

test_step(batch: tuple, batch_idx: int)collections.OrderedDict[source]

Forward propagate a inputs and targets pair for test.

Inputs:

batch (tuple): A train batch contains inputs, targets, input_lengths, target_lengths batch_idx (int): The index of batch

Returns

loss for training

Return type

loss (torch.Tensor)

training_step(batch: tuple, batch_idx: int)collections.OrderedDict[source]

Forward propagate a inputs and targets pair for training.

Inputs:

batch (tuple): A train batch contains inputs, targets, input_lengths, target_lengths batch_idx (int): The index of batch

Returns

loss for training

Return type

loss (torch.Tensor)

validation_step(batch: tuple, batch_idx: int)collections.OrderedDict[source]

Forward propagate a inputs and targets pair for validation.

Inputs:

batch (tuple): A train batch contains inputs, targets, input_lengths, target_lengths batch_idx (int): The index of batch

Returns

loss for training

Return type

loss (torch.Tensor)

Transformer Transducer Model Configuration

class openspeech.models.transformer_transducer.configurations.TransformerTransducerConfigs(model_name: str = 'transformer_transducer', encoder_dim: int = 512, d_ff: int = 2048, num_audio_layers: int = 18, num_label_layers: int = 2, num_attention_heads: int = 8, audio_dropout_p: float = 0.1, label_dropout_p: float = 0.1, decoder_hidden_state_dim: int = 512, decoder_output_dim: int = 512, conv_kernel_size: int = 31, max_positional_length: int = 5000, optimizer: str = 'adam')[source]

This is the configuration class to store the configuration of a TransformerTransducer.

It is used to initiated an TransformerTransducer model.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Configurations:

model_name (str): Model name (default: transformer_transducer) extractor (str): The CNN feature extractor. (default: conv2d_subsample) d_model (int): Dimension of model. (default: 512) d_ff (int): Dimension of feed forward network. (default: 2048) num_attention_heads (int): The number of attention heads. (default: 8) num_audio_layers (int): The number of audio layers. (default: 18) num_label_layers (int): The number of label layers. (default: 2) audio_dropout_p (float): The dropout probability of encoder. (default: 0.1) label_dropout_p (float): The dropout probability of decoder. (default: 0.1) decoder_hidden_state_dim (int): Hidden state dimension of decoder (default: 512) decoder_output_dim (int): dimension of model output. (default: 512) conv_kernel_size (int): Kernel size of convolution layer. (default: 31) max_positional_length (int): Max length of positional encoding. (default: 5000) optimizer (str): Optimizer for training. (default: adam)