Audio

Load Audio

openspeech.data.audio.load.load_audio(audio_path: str, sample_rate: int, del_silence: bool = False)numpy.ndarray[source]

Load audio file (PCM) to sound. if del_silence is True, Eliminate all sounds below 30dB. If exception occurs in numpy.memmap(), return None.

Spectrogram Feature Transform

class openspeech.data.audio.spectrogram.spectrogram.SpectrogramFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create a spectrogram from a audio signal.

Configurations:

name (str): name of feature transform. (default: spectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 161)

Parameters

configs (DictConfig) – configuraion set

Returns

A spectrogram feature. The shape is (seq_length, num_mels)

Return type

Tensor

Spectrogram Feature Transform Configuration

class openspeech.data.audio.spectrogram.configuration.SpectrogramConfigs(name: str = 'spectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 161)[source]

This is the configuration class to store the configuration of a SpectrogramTransform.

It is used to initiated an SpectrogramTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Configurations:

name (str): name of feature transform. (default: spectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 161)

Mel-Spectrogram Feature Transform

class openspeech.data.audio.melspectrogram.melspectrogram.MelSpectrogramFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.

Configurations:

name (str): name of feature transform. (default: melspectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)

Parameters

configs (DictConfig) – configuraion set

Returns

A mel-spectrogram feature. The shape is (seq_length, num_mels)

Return type

Tensor

Mel-Spectrogram Feature Transform Configuration

class openspeech.data.audio.melspectrogram.configuration.MelSpectrogramConfigs(name: str = 'melspectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80)[source]

This is the configuration class to store the configuration of a MelSpectrogramFeatureTransform.

It is used to initiated an MelSpectrogramFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Configurations:

name (str): name of feature transform. (default: melspectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)

Filter-Bank Feature Transform

class openspeech.data.audio.filter_bank.filter_bank.FilterBankFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats.

Configurations:

name (str): name of feature transform. (default: fbank) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)

Parameters

configs (DictConfig) – hydra configuraion set

Inputs:

signal (np.ndarray): signal from audio file.

Returns

A fbank identical to what Kaldi would output. The shape is (seq_length, num_mels)

Return type

Tensor

Filter-Bank Feature Transform Configuration

class openspeech.data.audio.filter_bank.configuration.FilterBankConfigs(name: str = 'fbank', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80)[source]

This is the configuration class to store the configuration of a FilterBankFeatureTransform.

It is used to initiated an FilterBankFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Configurations:

name (str): name of feature transform. (default: fbank) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)

MFCC Feature Transform

class openspeech.data.audio.mfcc.mfcc.MFCCFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create the Mel-frequency cepstrum coefficients from an audio signal.

By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.

This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.

Configurations:

name (str): name of feature transform. (default: mfcc) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)

Parameters

configs (DictConfig) – configuraion set

Returns

A mfcc feature. The shape is (seq_length, num_mels)

Return type

Tensor

MFCC Feature Transform Configuration

class openspeech.data.audio.mfcc.configuration.MFCCConfigs(name: str = 'mfcc', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 40)[source]

This is the configuration class to store the configuration of a MFCCFeatureTransform.

It is used to initiated an MFCCFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Configurations:

name (str): name of feature transform. (default: mfcc) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)

Augment

class openspeech.data.audio.augment.NoiseInjector(noise_dataset_dir: str, sample_rate: int = 16000, noise_level: float = 0.7)[source]

Provides noise injection for noise augmentation.

The noise augmentation process is as follows:

1: Randomly sample audios by noise_size from dataset 2: Extract noise from audio_paths 3: Add noise to sound

Parameters
  • noise_dataset_dir (str) – path of noise dataset

  • sample_rate (int) – sampling rate

  • noise_level (float) – level of noise

Inputs: signal
  • signal: signal from audio file

Returns: signal
  • signal: noise added signal

class openspeech.data.audio.augment.SpecAugment(freq_mask_para: int = 18, time_mask_num: int = 10, freq_mask_num: int = 2)[source]

Provides Spec Augment. A simple data augmentation method for speech recognition. This concept proposed in https://arxiv.org/abs/1904.08779

Parameters
  • freq_mask_para (int) – maximum frequency masking length

  • time_mask_num (int) – how many times to apply time masking

  • freq_mask_num (int) – how many times to apply frequency masking

Inputs: feature_vector
  • feature_vector (torch.FloatTensor): feature vector from audio file.

Returns: feature_vector:
  • feature_vector: masked feature vector.

class openspeech.data.audio.augment.TimeStretchAugment(min_rate: float = 0.7, max_rate: float = 1.4)[source]

Time-stretch an audio series by a fixed rate.

Inputs:

signal: np.ndarray [shape=(n,)] audio time series

Returns

np.ndarray [shape=(round(n/rate),)] audio time series stretched by the specified rate

Return type

y_stretch