--- title: Quick start with the FINN.no recsys slate dataset keywords: fastai sidebar: home_sidebar nb_path: "examples/quickstart-finn-recsys-slate-data.ipynb" ---
!pip install recsys_slates_dataset -q
It is possible to directly load the dataset as a pytorch dataloader which includes the same dataset splits etc as in the original paper.
Use the load_dataloaders
function in the dataset_torch
module. It has the following options:
Argument | Description |
---|---|
batch_size | Number of unique users sampled in each batch |
split_trainvalid | Ratio of full dataset dedicated to train (val/test is split evenly among the rest) |
t_testsplit | For users in valid and test, how many interactions should belong to training set |
sample_uniform_action | If this is True, the exposures in the dataset are sampled as in the all-item likelihood (see paper) |
The outputs of the function are ind2val
, itemattr
and a dictionary with pytorch dataloaders for training, validation and test.
import torch
from recsys_slates_dataset import dataset_torch
ind2val, itemattr, dataloaders = dataset_torch.load_dataloaders(data_dir="dat")
print("Dictionary containing the dataloaders:")
print(dataloaders)
batch = next(iter(dataloaders['train']))
for key, val in batch.items():
print(key, val.size())
data.npz
) The dataset consist of 2.2M unique users that have interacted up to 20 times with the internet platform platform, and has been exposed to up to 25 items at each interaction.
data.npz
contains all the slate and click data, and the two main arrays are click
and slate
.
The convention of the dimension of the arrays are that the first dimension is per user, second dimension is time and third dimension is the presented slate.
The full description of all array are as follows:
Name | Dimension | Description |
---|---|---|
slate | [userId, interaction num, slate pos] | the presented slates to the users; |
click | [userId, interaction num] | items clicked by the users in each slate |
interaction_type | [userId, interaction num] | type of interaction the user had with the platform (search or recommendation) |
click_idx | [userId, interaction num] | Auxillary data: The position of the click in the slate dataframe (integer from 0-24). Useful for e.g. categorical likelihoods |
slate_lengths | [userId, interaction num] | Auxillary data: the actual length of the slate. Same as 25- "number of pad index in action" |
dat = dataloaders['train'].dataset.data
# Print dimensions of all arrays:
for key, val in dat.items():
print(f"{key} : \t {val.size()}")
print("Slate:")
print(dat['slate'][5,3])
print(" ")
print("Click:")
print(dat['click'][5,3])
print("Type of interaction: (1 implies search, see ind2val file)")
print(dat['interaction_type'][5,3])
From the above extraction we can see that user 5 at interaction number 3 was presented with a total of 7 items: 6 "real" items and the "no-click" item that has index 1. The remaining positions in the array is padded with the index 0. The "no-click" item is always present in the slates, as the user has the alternative not to click on any of the presented items in the slate. Further, we see that the user clicked on the 4'th item in the slate. The slate length and the click position can be found by the following auxillary arrays:
print("Click_idx:")
print(dat['click_idx'][5,3])
print("Slate lengths:")
print(dat['slate_lengths'][5,3])
ind2val.json
) This files contains mapping from indices to values for the attributes category and interaction_type.
Name | Length | Description |
---|---|---|
category | 290 | Mapping from the category index to a text string that describes the category. The category value is a text string that describes the category and location of the group |
interaction_type | 3 | Indices of whether the presented slate originated from search or recommendations |
for key, val in ind2val.items():
print(" ")
print(f"{key} first entries:")
for idx, name in val.items():
print(f"{idx}: {val[idx]}")
if idx >3:
break
for key, val in itemattr.items():
print(f"{key} : {val.shape}")
print("\nThe full dictionary:")
itemattr
print("Find the itemId that were click by user 5 in interaction 3:")
itemId = [dat['click'][5,3]]
print(f"itemId: {itemId}")
print("\nFind the category index of that item in itemattr:")
cat_idx = itemattr['category'][itemId]
print(f"Category index: {cat_idx}")
print("\nFinally, find the category name by using ind2val:")
cat_name = ind2val['category'][cat_idx.item()]
print(f"Category name: {cat_name}")
print(f"Ratio of no clicks: {(dat['click']==1).sum() / (dat['click']!=0).sum():.2f}")
print(f"Average slate length: {(dat['slate_lengths'][dat['slate_lengths']!=0]).float().mean():.2f}")
print(f"Ratio of slates that are recommendations: {(dat['interaction_type']==2).sum() / (dat['interaction_type']!=0).sum():.3f}")
print(f"Average number of interactions per user: {(dat['click']!=0).sum(-1).float().mean():.2f}")
Each batch returns a dictionary of pytorch tensors with data, and contains the usual data fields described above.
In addition, it contains a mask_type
tensor which explains whether each click belongs to train, valid or test.
It is of the same dimensionality as the click tensor (num users * num interactions
).
This is because we want to return the full sequence of interactions so that e.g. the test set can use the first clicks of the user (which belongs to the training set) to build a user profile.
The mask is defined in the following way:
mask2split = {
0 : 'PAD',
1 : 'train',
2 : 'valid',
3 : 'test'
}
If the mask equals zero it means that the length of the user sequence was shorter than this index. The modeler has to take care to not train on elements in the validation or test dataset. Typically this can be done by masking all losses that does not originate from the training dataset:
train_mask = (batch['mask_type']==1)
train_mask
For example, for user number 1 in this batch, the first five interactions belong to the training set, and the remaining belongs to the validation set.
We can extract the clicks that belong to the training set by using mask_type
:
print("Mask of user 2:")
print(batch['mask_type'][1,])
print(" ")
print("Clicks belonging to the training set:")
print(train_mask[1,])
print(" ")
print("Select only the clicks in training dataset:")
batch['click'][1,][train_mask[1,]]