Transformer Transducer Model¶
Transformer Transducer Model¶
-
class
openspeech.models.transformer_transducer.model.
TransformerTransducerModel
(configs: omegaconf.dictconfig.DictConfig, vocab: openspeech.vocabs.vocab.Vocabulary)[source]¶ Transformer-Transducer is that every layer is identical for both audio and label encoders. Unlike the basic transformer structure, the audio encoder and label encoder are separate. So, the alignment is handled by a separate forward-backward process within the RNN-T architecture. And we replace the LSTM encoders in RNN-T architecture with Transformer encoders.
- Parameters
configs (DictConfig) – configuraion set
vocab (Vocabulary) – vocab of training data
- Inputs:
- inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded
FloatTensor of size
(batch, seq_length, dimension)
.
input_lengths (torch.LongTensor): The length of input tensor.
(batch)
- Returns
Result of model predictions.
- Return type
y_hats (torch.FloatTensor)
-
decode
(encoder_outputs: torch.Tensor, max_length: int) → torch.Tensor[source]¶ Decode encoder_outputs.
- Parameters
encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size
(seq_length, dimension)
max_length (int) – max decoding time step
- Returns
model’s predictions.
- Return type
y_hats (torch.IntTensor)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Dict[str, torch.Tensor][source]¶ Decode encoder_outputs.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
Result of model predictions.
- Return type
outputs (dict)
-
test_step
(batch: tuple, batch_idx: int) → collections.OrderedDict[source]¶ Forward propagate a inputs and targets pair for test.
- Inputs:
batch (tuple): A train batch contains inputs, targets, input_lengths, target_lengths batch_idx (int): The index of batch
- Returns
loss for training
- Return type
loss (torch.Tensor)
-
training_step
(batch: tuple, batch_idx: int) → collections.OrderedDict[source]¶ Forward propagate a inputs and targets pair for training.
- Inputs:
batch (tuple): A train batch contains inputs, targets, input_lengths, target_lengths batch_idx (int): The index of batch
- Returns
loss for training
- Return type
loss (torch.Tensor)
-
validation_step
(batch: tuple, batch_idx: int) → collections.OrderedDict[source]¶ Forward propagate a inputs and targets pair for validation.
- Inputs:
batch (tuple): A train batch contains inputs, targets, input_lengths, target_lengths batch_idx (int): The index of batch
- Returns
loss for training
- Return type
loss (torch.Tensor)
Transformer Transducer Model Configuration¶
-
class
openspeech.models.transformer_transducer.configurations.
TransformerTransducerConfigs
(model_name: str = 'transformer_transducer', encoder_dim: int = 512, d_ff: int = 2048, num_audio_layers: int = 18, num_label_layers: int = 2, num_attention_heads: int = 8, audio_dropout_p: float = 0.1, label_dropout_p: float = 0.1, decoder_hidden_state_dim: int = 512, decoder_output_dim: int = 512, conv_kernel_size: int = 31, max_positional_length: int = 5000, optimizer: str = 'adam')[source]¶ This is the configuration class to store the configuration of a
TransformerTransducer
.It is used to initiated an TransformerTransducer model.
Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.
- Configurations:
model_name (str): Model name (default: transformer_transducer) extractor (str): The CNN feature extractor. (default: conv2d_subsample) d_model (int): Dimension of model. (default: 512) d_ff (int): Dimension of feed forward network. (default: 2048) num_attention_heads (int): The number of attention heads. (default: 8) num_audio_layers (int): The number of audio layers. (default: 18) num_label_layers (int): The number of label layers. (default: 2) audio_dropout_p (float): The dropout probability of encoder. (default: 0.1) label_dropout_p (float): The dropout probability of decoder. (default: 0.1) decoder_hidden_state_dim (int): Hidden state dimension of decoder (default: 512) decoder_output_dim (int): dimension of model output. (default: 512) conv_kernel_size (int): Kernel size of convolution layer. (default: 31) max_positional_length (int): Max length of positional encoding. (default: 5000) optimizer (str): Optimizer for training. (default: adam)