Datasets¶
AISHELL-1¶
-
class
openspeech.datasets.aishell.lit_data_module.
LightningAIShellDataModule
(*args: Any, **kwargs: Any)[source]¶ Lightning data module for AIShell-1.
- Parameters
configs (DictConfig) – configuration set.
-
prepare_data
()[source]¶ Prepare AI-Shell manifest file. If there is not exist manifest file, generate manifest file.
- Returns
vocab class of KsponSpeech.
- Return type
vocab (Vocabulary)
-
setup
(stage: Optional[str] = None, vocab: openspeech.vocabs.vocab.Vocabulary = None)[source]¶ Split train and valid dataset for training.
- Parameters
stage (str) – stage of training. train or valid
vocab (Vocabulary) – vocab class of KsponSpeech.
- Returns
None
-
test_dataloader
() → torch.utils.data.dataloader.DataLoader[source]¶ Implement one or multiple PyTorch DataLoaders for testing.
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Returns
Single or multiple PyTorch DataLoaders.
Example:
def test_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def test_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a test dataset and a
test_step()
, you don’t need to implement this method.Note
In the case where you return multiple test dataloaders, the
test_step()
will have an argumentdataloader_idx
which matches the order here.
-
train_dataloader
() → torch.utils.data.dataloader.DataLoader[source]¶ Implement one or more PyTorch DataLoaders for training.
- Returns
Either a single PyTorch
DataLoader
or a collection of these (list, dict, nested lists and dicts). In the case of multiple dataloaders, please see this page
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
…
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Example:
# single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader}
-
val_dataloader
() → torch.utils.data.dataloader.DataLoader[source]¶ Implement one or multiple PyTorch DataLoaders for validation.
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.It’s recommended that all data downloads and preparation happen in
prepare_data()
.Note
Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Returns
Single or multiple PyTorch DataLoaders.
Examples:
def val_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def val_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a validation dataset and a
validation_step()
, you don’t need to implement this method.Note
In the case where you return multiple validation dataloaders, the
validation_step()
will have an argumentdataloader_idx
which matches the order here.
KsponSpeech¶
-
class
openspeech.datasets.ksponspeech.lit_data_module.
LightningKsponSpeechDataModule
(*args: Any, **kwargs: Any)[source]¶ Lightning data module for KsponSpeech.
- Parameters
configs (DictConfig) – configuration set.
-
prepare_data
()[source]¶ Prepare KsponSpeech manifest file. If there is not exist manifest file, generate manifest file.
- Returns
vocab class of KsponSpeech.
- Return type
vocab (Vocabulary)
-
setup
(stage: Optional[str] = None, vocab: openspeech.vocabs.vocab.Vocabulary = None)[source]¶ Split train and valid dataset for training.
- Parameters
stage (str) – stage of training. train or valid
vocab (Vocabulary) – vocab class of KsponSpeech.
- Returns
None
-
test_dataloader
() → openspeech.data.data_loader.AudioDataLoader[source]¶ Return data loader for training.
-
train_dataloader
() → openspeech.data.data_loader.AudioDataLoader[source]¶ Return data loader for training.
-
val_dataloader
() → openspeech.data.data_loader.AudioDataLoader[source]¶ Return data loader for validation.
LibriSpeech¶
-
class
openspeech.datasets.librispeech.lit_data_module.
LightningLibriSpeechDataModule
(*args: Any, **kwargs: Any)[source]¶ PyTorch Lightning Data Module for LibriSpeech Dataset.
- Parameters
configs (DictConfig) – configuraion set
-
prepare_data
() → openspeech.vocabs.vocab.Vocabulary[source]¶ Prepare librispeech data
- Returns
vocab class of KsponSpeech.
- Return type
vocab (Vocabulary)
-
setup
(stage: Optional[str] = None, vocab: openspeech.vocabs.vocab.Vocabulary = None) → None[source]¶ Split dataset into train, valid, and test.
-
test_dataloader
() → torch.utils.data.dataloader.DataLoader[source]¶ Implement one or multiple PyTorch DataLoaders for testing.
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Returns
Single or multiple PyTorch DataLoaders.
Example:
def test_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def test_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a test dataset and a
test_step()
, you don’t need to implement this method.Note
In the case where you return multiple test dataloaders, the
test_step()
will have an argumentdataloader_idx
which matches the order here.
-
train_dataloader
() → torch.utils.data.dataloader.DataLoader[source]¶ Implement one or more PyTorch DataLoaders for training.
- Returns
Either a single PyTorch
DataLoader
or a collection of these (list, dict, nested lists and dicts). In the case of multiple dataloaders, please see this page
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
…
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Example:
# single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader}
-
val_dataloader
() → torch.utils.data.dataloader.DataLoader[source]¶ Implement one or multiple PyTorch DataLoaders for validation.
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.It’s recommended that all data downloads and preparation happen in
prepare_data()
.Note
Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Returns
Single or multiple PyTorch DataLoaders.
Examples:
def val_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def val_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a validation dataset and a
validation_step()
, you don’t need to implement this method.Note
In the case where you return multiple validation dataloaders, the
validation_step()
will have an argumentdataloader_idx
which matches the order here.