Audio¶
Load Audio¶
Spectrogram Feature Transform¶
-
class
openspeech.data.audio.spectrogram.spectrogram.
SpectrogramFeatureTransform
(configs: omegaconf.dictconfig.DictConfig)[source]¶ Create a spectrogram from a audio signal.
- Configurations:
name (str): name of feature transform. (default: spectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 161)
- Parameters
configs (DictConfig) – configuraion set
- Returns
A spectrogram feature. The shape is
(seq_length, num_mels)
- Return type
Tensor
Spectrogram Feature Transform Configuration¶
-
class
openspeech.data.audio.spectrogram.configuration.
SpectrogramConfigs
(name: str = 'spectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 161)[source]¶ This is the configuration class to store the configuration of a
SpectrogramTransform
.It is used to initiated an SpectrogramTransform feature transform.
Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.
- Configurations:
name (str): name of feature transform. (default: spectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 161)
Mel-Spectrogram Feature Transform¶
-
class
openspeech.data.audio.melspectrogram.melspectrogram.
MelSpectrogramFeatureTransform
(configs: omegaconf.dictconfig.DictConfig)[source]¶ Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.
- Configurations:
name (str): name of feature transform. (default: melspectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)
- Parameters
configs (DictConfig) – configuraion set
- Returns
A mel-spectrogram feature. The shape is
(seq_length, num_mels)
- Return type
Tensor
Mel-Spectrogram Feature Transform Configuration¶
-
class
openspeech.data.audio.melspectrogram.configuration.
MelSpectrogramConfigs
(name: str = 'melspectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80)[source]¶ This is the configuration class to store the configuration of a
MelSpectrogramFeatureTransform
.It is used to initiated an MelSpectrogramFeatureTransform feature transform.
Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.
- Configurations:
name (str): name of feature transform. (default: melspectrogram) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)
Filter-Bank Feature Transform¶
-
class
openspeech.data.audio.filter_bank.filter_bank.
FilterBankFeatureTransform
(configs: omegaconf.dictconfig.DictConfig)[source]¶ Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats.
- Configurations:
name (str): name of feature transform. (default: fbank) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)
- Parameters
configs (DictConfig) – hydra configuraion set
- Inputs:
signal (np.ndarray): signal from audio file.
- Returns
A fbank identical to what Kaldi would output. The shape is
(seq_length, num_mels)
- Return type
Tensor
Filter-Bank Feature Transform Configuration¶
-
class
openspeech.data.audio.filter_bank.configuration.
FilterBankConfigs
(name: str = 'fbank', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80)[source]¶ This is the configuration class to store the configuration of a
FilterBankFeatureTransform
.It is used to initiated an FilterBankFeatureTransform feature transform.
Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.
- Configurations:
name (str): name of feature transform. (default: fbank) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)
MFCC Feature Transform¶
-
class
openspeech.data.audio.mfcc.mfcc.
MFCCFeatureTransform
(configs: omegaconf.dictconfig.DictConfig)[source]¶ Create the Mel-frequency cepstrum coefficients from an audio signal.
By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.
This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.
- Configurations:
name (str): name of feature transform. (default: mfcc) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)
- Parameters
configs (DictConfig) – configuraion set
- Returns
A mfcc feature. The shape is
(seq_length, num_mels)
- Return type
Tensor
MFCC Feature Transform Configuration¶
-
class
openspeech.data.audio.mfcc.configuration.
MFCCConfigs
(name: str = 'mfcc', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 40)[source]¶ This is the configuration class to store the configuration of a
MFCCFeatureTransform
.It is used to initiated an MFCCFeatureTransform feature transform.
Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.
- Configurations:
name (str): name of feature transform. (default: mfcc) sample_rate (int): sampling rate of audio (default: 16000) frame_length (float): frame length for spectrogram (default: 20.0) frame_shift (float): length of hop between STFT (default: 10.0) del_silence (bool): flag indication whether to apply delete silence or not (default: False) num_mels (int): the number of mfc coefficients to retain. (default: 80)
Augment¶
-
class
openspeech.data.audio.augment.
NoiseInjector
(noise_dataset_dir: str, sample_rate: int = 16000, noise_level: float = 0.7)[source]¶ Provides noise injection for noise augmentation.
- The noise augmentation process is as follows:
1: Randomly sample audios by noise_size from dataset 2: Extract noise from audio_paths 3: Add noise to sound
- Parameters
- Inputs: signal
signal: signal from audio file
- Returns: signal
signal: noise added signal
-
class
openspeech.data.audio.augment.
SpecAugment
(freq_mask_para: int = 18, time_mask_num: int = 10, freq_mask_num: int = 2)[source]¶ Provides Spec Augment. A simple data augmentation method for speech recognition. This concept proposed in https://arxiv.org/abs/1904.08779
- Parameters
- Inputs: feature_vector
feature_vector (torch.FloatTensor): feature vector from audio file.
- Returns: feature_vector:
feature_vector: masked feature vector.
-
class
openspeech.data.audio.augment.
TimeStretchAugment
(min_rate: float = 0.7, max_rate: float = 1.4)[source]¶ Time-stretch an audio series by a fixed rate.
- Inputs:
signal: np.ndarray [shape=(n,)] audio time series
- Returns
np.ndarray [shape=(round(n/rate),)] audio time series stretched by the specified rate
- Return type
y_stretch