Data

Dataset

class openspeech.data.dataset.SpeechToTextDataset(configs: omegaconf.dictconfig.DictConfig, dataset_path: str, audio_paths: list, transcripts: list, sos_id: int = 1, eos_id: int = 2, del_silence: bool = False, apply_spec_augment: bool = False, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]

Dataset for audio & transcript matching

Note

Do not use this class directly, use one of the sub classes.

Parameters
  • dataset_path (str) – path of librispeech dataset

  • audio_paths (list) – list of audio path

  • transcripts (list) – list of transript

  • sos_id (int) – identification of <|startofsentence|>

  • eos_id (int) – identification of <|endofsentence|>

  • del_silence (bool) – flag indication whether to apply delete silence or not

  • apply_spec_augment (bool) – flag indication whether to apply spec augment or not

  • apply_noise_augment (bool) – flag indication whether to apply noise augment or not

  • apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not

  • apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not

Data Loader

class openspeech.data.data_loader.AudioDataLoader(dataset: torch.utils.data.dataset.Dataset, num_workers: int, batch_sampler: torch.utils.data.sampler.Sampler, **kwargs)[source]

Audio Data Loader

Parameters
  • dataset (torch.utils.data.Dataset) – dataset from which to load the data.

  • num_workers (int) – how many subprocesses to use for data loading.

  • batch_sampler (torch.utils.data.sampler.Sampler) – defines the strategy to draw samples from the dataset.

class openspeech.data.data_loader.BucketingSampler(data_source, batch_size: int = 32, drop_last: bool = False)[source]

Samples batches assuming they are in order of size to batch similarly sized samples together.

Parameters
  • data_source (torch.utils.data.Dataset) – dataset to sample from

  • batch_size (int) – size of batch

  • drop_last (bool) – flat indication whether to drop last batch or not

Spectrogram Feature Transform

class openspeech.data.audio.spectrogram.spectrogram.SpectrogramFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create a spectrogram from a audio signal.

Configurations:

name (str): name of feature transform. (default: spectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 161)

Parameters

configs (DictConfig) – configuraion set

Returns

A spectrogram feature. The shape is (seq_length, num_mels)

Return type

Tensor

Spectrogram Feature Transform Configuration

class openspeech.data.audio.spectrogram.configuration.SpectrogramConfigs(name: str = 'spectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 161)[source]

This is the configuration class to store the configuration of a SpectrogramTransform.

It is used to initiated an SpectrogramTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Configurations:

name (str): name of feature transform. (default: spectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 161)

Mel-Spectrogram Feature Transform

class openspeech.data.audio.melspectrogram.melspectrogram.MelSpectrogramFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.

Configurations:

name (str): name of feature transform. (default: melspectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)

Parameters

configs (DictConfig) – configuraion set

Returns

A mel-spectrogram feature. The shape is (seq_length, num_mels)

Return type

Tensor

Mel-Spectrogram Feature Transform Configuration

class openspeech.data.audio.melspectrogram.configuration.MelSpectrogramConfigs(name: str = 'melspectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80)[source]

This is the configuration class to store the configuration of a MelSpectrogramFeatureTransform.

It is used to initiated an MelSpectrogramFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Configurations:

name (str): name of feature transform. (default: melspectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)