--- title: dataset_torch keywords: fastai sidebar: home_sidebar summary: "Module to load the slates dataset into a Pytorch Dataset and Dataloaders with default train/valid test splits." description: "Module to load the slates dataset into a Pytorch Dataset and Dataloaders with default train/valid test splits." nb_path: "dataset_torch.ipynb" ---
{% raw %}
{% endraw %} {% raw %}

class SequentialDataset[source]

SequentialDataset(*args, **kwds) :: Dataset

A Pytorch Dataset for the FINN Recsys Slates Dataset. Attributes: data: [Dict] A dictionary with tensors of the dataset. First dimension in each tensor must be the batch dimension. Requires the keys "click" and "slate". Additional elements can be added. sample_candidate_items: [int] Number of negative item examples sampled from the item universe for each interaction. If positive, the dataset provide an additional dictionary item "allitem". Often also called uniform candidate sampling. See Eide et. al. 2021 for more information.

{% endraw %} {% raw %}
{% endraw %} {% raw %}

load_dataloaders[source]

load_dataloaders(data_dir='dat', batch_size=1024, num_workers=0, sample_candidate_items=False, valid_pct=0.05, test_pct=0.05, t_testsplit=5, limit_num_users=None, seed=0)

Loads pytorch dataloaders to be used in training. If used with standard settings, the train/val/test split is equivalent to Eide et. al. 2021.

Attributes: data_dir: [str] where download and store data if not already downloaded. batch_size: [int] Batch size given by dataloaders. num_workers: [int] How many threads should be used to prepare batches of data. sample_candidate_items: [int] Number of negative item examples sampled from the item universe for each interaction. If positive, the dataset provide an additional dictionary item "allitem". Often also called uniform candidate sampling. See Eide et. al. 2021 for more information. valid_pct: [float] Percentage of users allocated to validation dataset. test_pct: [float] Percentage of users allocated to test dataset. t_testsplit: [int] For users allocated to validation and test datasets, how many initial interactions should be part of the training dataset. limit_num_users: [int] For debugging purposes, only return some users. seed: [int] Seed used to sample users/items.

{% endraw %} {% raw %}
{% endraw %} {% raw %}
ind2val, itemattr, dataloaders = load_dataloaders()
2021-08-13 10:15:07,665 Download data if not in data folder..
2021-08-13 10:15:07,666 Downloading data.npz
2021-08-13 10:15:07,667 Downloading ind2val.json
2021-08-13 10:15:07,667 Downloading itemattr.npz
2021-08-13 10:15:07,668 Done downloading all files.
2021-08-13 10:15:07,668 Load data..
2021-08-13 10:15:31,565 Loading dataset with slate size=torch.Size([2277645, 20, 25]) and uniform candidate sampling=False
2021-08-13 10:15:31,834 Loading dataset with slate size=torch.Size([2277645, 20, 25]) and uniform candidate sampling=False
2021-08-13 10:15:31,839 Loading dataset with slate size=torch.Size([113882, 20, 25]) and uniform candidate sampling=False
2021-08-13 10:15:31,844 Loading dataset with slate size=torch.Size([113882, 20, 25]) and uniform candidate sampling=False
2021-08-13 10:15:31,845 In train: num_users: 2277645, num_batches: 2225
2021-08-13 10:15:31,846 In valid: num_users: 113882, num_batches: 112
2021-08-13 10:15:31,846 In test: num_users: 113882, num_batches: 112
{% endraw %}