Vocabulary

Vocabulary

class openspeech.vocabs.vocab.Vocabulary(*args, **kwargs)[source]

Note

Do not use this class directly, use one of the sub classes.

AISHELL-1 Character

class openspeech.vocabs.aishell.character.AIShellCharacterVocabConfigs(sos_token: str = '<sos>', eos_token: str = '<eos>', pad_token: str = '<pad>', blank_token: str = '<blank>', encoding: str = 'utf-8', unit: str = 'aishell_character', vocab_path: str = '../../../data_aishell/aishell_labels.csv')[source]
class openspeech.vocabs.aishell.character.AIShellCharacterVocabulary(configs: omegaconf.dictconfig.DictConfig)[source]

Vocabulary Class in Character Units.

Parameters

configs (DictConfig) – configuration set.

label_to_string(labels)[source]

Converts label to string.

Parameters

labels (numpy.ndarray) – number label

Returns: sentence
  • sentence (str or list): symbol of labels

load_vocab(vocab_path, encoding='utf-8')[source]

Provides char2id, id2char

Parameters
  • vocab_path (str) – csv file with character labels

  • encoding (str) – encoding method

Returns: unit2id, id2unit
  • unit2id (dict): unit2id[unit] = id

  • id2unit (dict): id2unit[id] = unit

KsponSpeech Character

class openspeech.vocabs.ksponspeech.character.KsponSpeechCharacterVocabConfigs(sos_token: str = '<sos>', eos_token: str = '<eos>', pad_token: str = '<pad>', blank_token: str = '<blank>', encoding: str = 'utf-8', unit: str = 'kspon_character', vocab_path: str = '../../../aihub_labels.csv')[source]
class openspeech.vocabs.ksponspeech.character.KsponSpeechCharacterVocabulary(configs: omegaconf.dictconfig.DictConfig)[source]

Vocabulary Class in Character Units.

Parameters

configs (DictConfig) – configuration set.

label_to_string(labels)[source]

Converts label to string (number => Hangeul)

Parameters

labels (numpy.ndarray) – number label

Returns: sentence
  • sentence (str or list): symbol of labels

load_vocab(vocab_path, encoding='utf-8')[source]

Provides char2id, id2char

Parameters
  • vocab_path (str) – csv file with character labels

  • encoding (str) – encoding method

Returns: unit2id, id2unit
  • unit2id (dict): unit2id[unit] = id

  • id2unit (dict): id2unit[id] = unit

KsponSpeech Subword

class openspeech.vocabs.ksponspeech.subword.KsponSpeechSubwordVocabConfigs(sos_token: str = '<s>', eos_token: str = '</s>', pad_token: str = '<pad>', blank_token: str = '<blank>', encoding: str = 'utf-8', unit: str = 'kspon_subword', sp_model_path: str = 'sp.model', vocab_size: int = 3200)[source]
class openspeech.vocabs.ksponspeech.subword.KsponSpeechSubwordVocabulary(configs: omegaconf.dictconfig.DictConfig)[source]

Vocabulary Class in Subword Units.

Parameters

configs (DictConfig) – configuration set.

label_to_string(labels)[source]

Converts label to string (number => Hangeul)

Parameters

labels (numpy.ndarray) – number label

Returns: sentence
  • sentence (str or list): symbol of labels

KsponSpeech Grapheme

class openspeech.vocabs.ksponspeech.grapheme.KsponSpeechGraphemeVocabConfigs(sos_token: str = '<sos>', eos_token: str = '<eos>', pad_token: str = '<pad>', blank_token: str = '<blank>', encoding: str = 'utf-8', unit: str = 'kspon_grapheme', vocab_path: str = '../../../aihub_labels.csv')[source]
class openspeech.vocabs.ksponspeech.grapheme.KsponSpeechGraphemeVocabulary(configs: omegaconf.dictconfig.DictConfig)[source]

Vocabulary Class in Grapheme Units.

Parameters

configs (DictConfig) – configuration set.

label_to_string(labels)[source]

Converts label to string (number => Hangeul)

Parameters

labels (numpy.ndarray) – number label

Returns: sentence
  • sentence (str or list): symbol of labels

load_vocab(vocab_path, encoding='utf-8')[source]

Provides char2id, id2char

Parameters
  • vocab_path (str) – csv file with character labels

  • encoding (str) – encoding method

Returns: unit2id, id2unit
  • unit2id (dict): unit2id[unit] = id

  • id2unit (dict): id2unit[id] = unit

LibriSpeech Character

class openspeech.vocabs.librispeech.character.LibriSpeechCharacterVocabConfigs(sos_token: str = '<sos>', eos_token: str = '<eos>', pad_token: str = '<pad>', blank_token: str = '<blank>', encoding: str = 'utf-8', unit: str = 'libri_character', vocab_path: str = '../../../LibriSpeech/libri_labels.csv')[source]
class openspeech.vocabs.librispeech.character.LibriSpeechCharacterVocabulary(configs: omegaconf.dictconfig.DictConfig)[source]

Vocabulary Class in Character Units.

Parameters

configs (DictConfig) – configuration set.

label_to_string(labels)[source]

Converts label to string (number => Hangeul)

Parameters

labels (numpy.ndarray) – number label

Returns: sentence
  • sentence (str or list): symbol of labels

load_vocab(vocab_path, encoding='utf-8')[source]

Provides char2id, id2char

Parameters
  • vocab_path (str) – csv file with character labels

  • encoding (str) – encoding method

Returns: unit2id, id2unit
  • unit2id (dict): unit2id[unit] = id

  • id2unit (dict): id2unit[id] = unit

LibriSpeech Subword

class openspeech.vocabs.librispeech.subword.LibriSpeechSubwordVocabConfigs(sos_token: str = '<s>', eos_token: str = '</s>', pad_token: str = '<pad>', blank_token: str = '<blank>', encoding: str = 'utf-8', unit: str = 'libri_subword', vocab_size: int = 5000, vocab_path: str = '../../../LibriSpeech/')[source]
class openspeech.vocabs.librispeech.subword.LibriSpeechSubwordVocabulary(configs: omegaconf.dictconfig.DictConfig)[source]

Converts label to string for librispeech dataset.

Parameters

configs (DictConfig) – configuration set.