--- title: Quick start with the FINN.no recsys slate dataset keywords: fastai sidebar: home_sidebar nb_path: "examples/quickstart-finn-recsys-slate-data.ipynb" ---
{% raw %}
{% endraw %}

Install the recsys_slates_dataset pip package

{% raw %}
!pip install recsys_slates_dataset -q
{% endraw %}

Download and load dataloaders that are ready to use

It is possible to directly load the dataset as a pytorch dataloader which includes the same dataset splits etc as in the original paper. Use the load_dataloaders function in the dataset_torch module. It has the following options:

Argument Description
batch_size Number of unique users sampled in each batch
split_trainvalid Ratio of full dataset dedicated to train
(val/test is split evenly among the rest)
t_testsplit For users in valid and test,
how many interactions should belong to training set
sample_uniform_action If this is True, the exposures in the dataset
are sampled as in the all-item likelihood (see paper)

The outputs of the function are ind2val, itemattr and a dictionary with pytorch dataloaders for training, validation and test.

{% raw %}
import torch
from recsys_slates_dataset import dataset_torch
ind2val, itemattr, dataloaders = dataset_torch.load_dataloaders(data_dir="dat")

print("Dictionary containing the dataloaders:")
print(dataloaders)
2021-07-03 21:13:13,672 Download data if not in data folder..
2021-07-03 21:13:13,676 Downloading data.npz
2021-07-03 21:13:13,691 Downloading ind2val.json
2021-07-03 21:13:13,694 Downloading itemattr.npz
2021-07-03 21:13:13,698 Done downloading all files.
2021-07-03 21:13:13,707 Load data..
2021-07-03 21:13:53,963 Loading dataset with slate size=torch.Size([2277645, 20, 25]) and uniform candidate sampling=False
2021-07-03 21:13:54,179 In train: num_users: 2277645, num_batches: 2225
2021-07-03 21:13:54,187 In valid: num_users: 113882, num_batches: 112
2021-07-03 21:13:54,192 In test: num_users: 113882, num_batches: 112
Dictionary containing the dataloaders:
{'train': <torch.utils.data.dataloader.DataLoader object at 0x7fcd88da59a0>, 'valid': <torch.utils.data.dataloader.DataLoader object at 0x7fcd88dc6e80>, 'test': <torch.utils.data.dataloader.DataLoader object at 0x7fcd88dc6550>}
{% endraw %}

Batches

The batches are split by userId and provides the necessary information for training. We will explain each element below:

{% raw %}
batch = next(iter(dataloaders['train']))
for key, val in batch.items():
    print(key, val.size())
userId torch.Size([1024])
click torch.Size([1024, 20])
click_idx torch.Size([1024, 20])
slate_lengths torch.Size([1024, 20])
slate torch.Size([1024, 20, 25])
interaction_type torch.Size([1024, 20])
mask_type torch.Size([1024, 20])
{% endraw %}

Interaction data (data.npz)

The dataset consist of 2.2M unique users that have interacted up to 20 times with the internet platform platform, and has been exposed to up to 25 items at each interaction. data.npz contains all the slate and click data, and the two main arrays are click and slate. The convention of the dimension of the arrays are that the first dimension is per user, second dimension is time and third dimension is the presented slate. The full description of all array are as follows:

Name Dimension Description
slate [userId, interaction num, slate pos] the presented slates to the users;
click [userId, interaction num] items clicked by the users in each slate
interaction_type [userId, interaction num] type of interaction the user had with the platform (search or recommendation)
click_idx [userId, interaction num] Auxillary data: The position of the click in the slate dataframe (integer from 0-24).
Useful for e.g. categorical likelihoods
slate_lengths [userId, interaction num] Auxillary data: the actual length of the slate.
Same as 25-"number of pad index in action"
{% raw %}
dat = dataloaders['train'].dataset.data

# Print dimensions of all arrays:
for key, val in dat.items():
  print(f"{key} : \t {val.size()}")
userId : 	 torch.Size([2277645])
click : 	 torch.Size([2277645, 20])
click_idx : 	 torch.Size([2277645, 20])
slate_lengths : 	 torch.Size([2277645, 20])
slate : 	 torch.Size([2277645, 20, 25])
interaction_type : 	 torch.Size([2277645, 20])
mask_type : 	 torch.Size([2277645, 20])
{% endraw %}

Example: Get one interaction

Get the presented slate + click for user 5 at interaction number 3

{% raw %}
print("Slate:")
print(dat['slate'][5,3])
print(" ")
print("Click:")
print(dat['click'][5,3])
print("Type of interaction: (1 implies search, see ind2val file)")
print(dat['interaction_type'][5,3])
Slate:
tensor([     1, 638995, 638947, 638711, 637590, 637930, 638894,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0])
 
Click:
tensor(637590)
Type of interaction: (1 implies search, see ind2val file)
tensor(1)
{% endraw %}

From the above extraction we can see that user 5 at interaction number 3 was presented with a total of 7 items: 6 "real" items and the "no-click" item that has index 1. The remaining positions in the array is padded with the index 0. The "no-click" item is always present in the slates, as the user has the alternative not to click on any of the presented items in the slate. Further, we see that the user clicked on the 4'th item in the slate. The slate length and the click position can be found by the following auxillary arrays:

{% raw %}
print("Click_idx:")
print(dat['click_idx'][5,3])
print("Slate lengths:")
print(dat['slate_lengths'][5,3])
Click_idx:
tensor(4)
Slate lengths:
tensor(7)
{% endraw %}

Index to item (ind2val.json)

This files contains mapping from indices to values for the attributes category and interaction_type.

Name Length Description
category 290 Mapping from the category index to a text string that describes the category.
The category value is a text string that describes the category and location of the group
interaction_type 3 Indices of whether the presented slate originated from search or recommendations

Example ind2val

We print out the first elements of each index. For example, we see that category 3 is "BAP,antiques,Trøndelag" which implies the category contains antiques sold in the county of Trøndelag.

{% raw %}
for key, val in ind2val.items():
  print(" ")
  print(f"{key} first entries:")
  for idx, name in val.items():
    print(f"{idx}: {val[idx]}")
    if idx >3:
      break
 
category first entries:
0: PAD
1: noClick
2: <UNK>
3: BAP,antiques,Trøndelag
4: MOTOR,,Sogn og Fjordane
 
interaction_type first entries:
1: search
2: rec
0: <UNK>
{% endraw %}

Item attributes (itemattr.npz)

A numpy array that encodes the category of each item.

Name Dimension Description
category [itemId] The group that each item belong to
{% raw %}
for key, val in itemattr.items():
  print(f"{key} : {val.shape}")

print("\nThe full dictionary:")
itemattr
category : (1311775,)

The full dictionary:
{'category': array([  0.,   1.,   2., ..., 289., 289., 289.])}
{% endraw %}

Example itemattr

Get the category of the clicked item above (from user 5, interaction number 3)

{% raw %}
print("Find the itemId that were click by user 5 in interaction 3:")
itemId = [dat['click'][5,3]]
print(f"itemId: {itemId}")

print("\nFind the category index of that item in itemattr:")
cat_idx = itemattr['category'][itemId]
print(f"Category index: {cat_idx}")

print("\nFinally, find the category name by using ind2val:")
cat_name = ind2val['category'][cat_idx.item()]
print(f"Category name: {cat_name}")
Find the itemId that were click by user 5 in interaction 3:
itemId: [tensor(637590)]

Find the category index of that item in itemattr:
Category index: [135.]

Finally, find the category name by using ind2val:
Category name: REAL_ESTATE,,Oppland
{% endraw %}
{% raw %}
print(f"Ratio of no clicks: {(dat['click']==1).sum() / (dat['click']!=0).sum():.2f}")
print(f"Average slate length: {(dat['slate_lengths'][dat['slate_lengths']!=0]).float().mean():.2f}")
print(f"Ratio of slates that are recommendations: {(dat['interaction_type']==2).sum() / (dat['interaction_type']!=0).sum():.3f}")
print(f"Average number of interactions per user: {(dat['click']!=0).sum(-1).float().mean():.2f}")
Ratio of no clicks: 0.24
Average slate length: 11.14
Ratio of slates that are recommendations: 0.303
Average number of interactions per user: 16.43
{% endraw %}

Masking of train/test/val

Each batch returns a dictionary of pytorch tensors with data, and contains the usual data fields described above. In addition, it contains a mask_type tensor which explains whether each click belongs to train, valid or test. It is of the same dimensionality as the click tensor (num users * num interactions). This is because we want to return the full sequence of interactions so that e.g. the test set can use the first clicks of the user (which belongs to the training set) to build a user profile. The mask is defined in the following way:

mask2split = {
    0 : 'PAD',
    1 : 'train',
    2 : 'valid',
    3 : 'test'
}

If the mask equals zero it means that the length of the user sequence was shorter than this index. The modeler has to take care to not train on elements in the validation or test dataset. Typically this can be done by masking all losses that does not originate from the training dataset:

{% raw %}
train_mask = (batch['mask_type']==1)
train_mask
tensor([[True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        ...,
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True]])
{% endraw %}

For example, for user number 1 in this batch, the first five interactions belong to the training set, and the remaining belongs to the validation set. We can extract the clicks that belong to the training set by using mask_type:

{% raw %}
print("Mask of user 2:")
print(batch['mask_type'][1,])
print(" ")
print("Clicks belonging to the training set:")
print(train_mask[1,])
print(" ")
print("Select only the clicks in training dataset:")
batch['click'][1,][train_mask[1,]]
Mask of user 2:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
 
Clicks belonging to the training set:
tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True])
 
Select only the clicks in training dataset:
tensor([ 246058,  522114,  688321,       1,       1,  492102,  342033, 1050842,
              1,       1,  878114, 1104893,  581533,       1, 1114863,  191381,
         493192,  736750,  693049,  493709])
{% endraw %}