--- title: Metadatasets: a dataset of datasets keywords: fastai sidebar: home_sidebar summary: "This functionality will allow you to create a dataset from data stores in multiple, smaller datasets." description: "This functionality will allow you to create a dataset from data stores in multiple, smaller datasets." nb_path: "nbs/002c_data.metadatasets.ipynb" ---
{% raw %}
{% endraw %}
  • I'd like to thank both Thomas Capelle (https://github.com/tcapelle) and Xander Dunn (https://github.com/xanderdunn) for their contributions to make this code possible.
  • This functionality allows you to use multiple numpy arrays instead of a single one, which may be very useful in many practical settings. I've tested it with 10k+ datasets and it works well.
{% raw %}
{% endraw %} {% raw %}

class TSMetaDataset[source]

TSMetaDataset(dataset_list, **kwargs)

A dataset capable of indexing mutiple datasets at the same time

{% endraw %} {% raw %}

class TSMetaDatasets[source]

TSMetaDatasets(metadataset, splits) :: FilteredBase

Base class for lists with subsets

{% endraw %} {% raw %}
{% endraw %}

Let's create 3 datasets. In this case they will have different sizes.

{% raw %}
vocab = L(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
dsets = []
for i in range(3):
    size = np.random.randint(50, 150)
    X = torch.rand(size, 5, 50)
    y = vocab[torch.randint(0, 10, (size,))]
    tfms = [None, TSClassification(add_na=True)]
    dset = TSDatasets(X, y, tfms=tfms)
    dsets.append(dset)
dsets
[(#50) [(TSTensor(vars:5, len:50), TensorCategory(7)),(TSTensor(vars:5, len:50), TensorCategory(5)),(TSTensor(vars:5, len:50), TensorCategory(5)),(TSTensor(vars:5, len:50), TensorCategory(4)),(TSTensor(vars:5, len:50), TensorCategory(7)),(TSTensor(vars:5, len:50), TensorCategory(3)),(TSTensor(vars:5, len:50), TensorCategory(2)),(TSTensor(vars:5, len:50), TensorCategory(4)),(TSTensor(vars:5, len:50), TensorCategory(1)),(TSTensor(vars:5, len:50), TensorCategory(1))...],
 (#111) [(TSTensor(vars:5, len:50), TensorCategory(7)),(TSTensor(vars:5, len:50), TensorCategory(5)),(TSTensor(vars:5, len:50), TensorCategory(8)),(TSTensor(vars:5, len:50), TensorCategory(7)),(TSTensor(vars:5, len:50), TensorCategory(2)),(TSTensor(vars:5, len:50), TensorCategory(9)),(TSTensor(vars:5, len:50), TensorCategory(7)),(TSTensor(vars:5, len:50), TensorCategory(1)),(TSTensor(vars:5, len:50), TensorCategory(3)),(TSTensor(vars:5, len:50), TensorCategory(10))...],
 (#110) [(TSTensor(vars:5, len:50), TensorCategory(6)),(TSTensor(vars:5, len:50), TensorCategory(1)),(TSTensor(vars:5, len:50), TensorCategory(10)),(TSTensor(vars:5, len:50), TensorCategory(1)),(TSTensor(vars:5, len:50), TensorCategory(6)),(TSTensor(vars:5, len:50), TensorCategory(6)),(TSTensor(vars:5, len:50), TensorCategory(7)),(TSTensor(vars:5, len:50), TensorCategory(9)),(TSTensor(vars:5, len:50), TensorCategory(2)),(TSTensor(vars:5, len:50), TensorCategory(10))...]]
{% endraw %} {% raw %}
metadataset = TSMetaDataset(dsets)
metadataset, metadataset.vars, metadataset.len
(<__main__.TSMetaDataset at 0x7fd719f0c950>, 5, 50)
{% endraw %}

We'll apply splits now to create train and valid metadatasets:

{% raw %}
splits = TimeSplitter()(metadataset)
splits
((#217) [0,1,2,3,4,5,6,7,8,9...],
 (#54) [217,218,219,220,221,222,223,224,225,226...])
{% endraw %} {% raw %}
metadatasets = TSMetaDatasets(metadataset, splits=splits)
metadatasets.train, metadatasets.valid
(<__main__.TSMetaDataset at 0x7fd71a120a90>,
 <__main__.TSMetaDataset at 0x7fd71a120e90>)
{% endraw %} {% raw %}
dls = TSDataLoaders.from_dsets(metadatasets.train, metadatasets.valid)
xb, yb = first(dls.train)
xb, yb
(TSTensor(samples:64, vars:5, len:50),
 TensorCategory([ 7,  4,  9,  2,  3,  2, 10,  6,  1, 10,  7,  3,  9,  9,  7,  3,  2,  2,
          5,  3,  5,  5,  3,  7,  7, 10,  4,  3,  3,  1, 10,  3,  9,  6,  4,  4,
          7,  2,  4,  8,  2,  9,  4,  5,  3,  7, 10,  9,  9, 10,  7,  9,  3, 10,
          7,  5,  6,  6, 10,  5,  8,  9,  8,  5]))
{% endraw %}

There also en easy way to map any particular sample in a batch to the original dataset and id:

{% raw %}
dls = TSDataLoaders.from_dsets(metadatasets.train, metadatasets.valid)
xb, yb = first(dls.train)
mappings = dls.train.dataset.mapping_idxs
for i, (xbi, ybi) in enumerate(zip(xb, yb)):
    ds, idx = mappings[i]
    test_close(dsets[ds][idx][0].data, xbi)
    test_close(dsets[ds][idx][1].data, ybi)
{% endraw %}

For example the 3rd sample in this batch would be:

{% raw %}
dls.train.dataset.mapping_idxs[2]
array([ 0, 38], dtype=int32)
{% endraw %}