pydgn.data

data.dataset

class pydgn.data.dataset.ConcatFromListDataset(data_list: List[torch_geometric.data.data.Data])

Bases: torch_geometric.data.in_memory_dataset.InMemoryDataset

Create a dataset from a list of torch_geometric.data.Data objects. Inherits from torch_geometric.data.InMemoryDataset

Parameters

data_list (list) – List of graphs.

download()

Downloads the dataset to the self.raw_dir folder.

process()

Processes the dataset to the self.processed_dir folder.

property processed_file_names: Union[str, List[str], Tuple]

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property raw_file_names: Union[str, List[str], Tuple]

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

class pydgn.data.dataset.DatasetInterface(root: str, name: str, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, pre_filter: Optional[Callable] = None)

Bases: torch_geometric.data.dataset.Dataset

Class that defines a number of properties essential to all datasets implementations inside PyDGN. These properties are used by the training engine and forwarded to the model to be trained. For some datasets, e.g., torch_geometric.datasets.TUDataset, implementing this interface is straightforward.

Parameters
  • root (str) – root folder where to store the dataset

  • name (str) – name of the dataset

  • transform (Optional[Callable]) – transformations to apply to each Data object at run time

  • pre_transform (Optional[Callable]) – transformations to apply to each Data object at dataset creation time

  • pre_filter (Optional[Callable]) – sample filtering to apply to each Data object at dataset creation time

property dim_edge_features

Specifies the number of edge features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_node_features

Specifies the number of node features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_target

Specifies the dimension of each target vector.

download()

Downloads the dataset to the self.raw_dir folder.

get(idx: int) torch_geometric.data.data.Data

Gets the data object at index idx.

len() int

Returns the number of graphs stored in the dataset.

process()

Processes the dataset to the self.processed_dir folder.

property processed_file_names: Union[str, List[str], Tuple]

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property raw_file_names: Union[str, List[str], Tuple]

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

class pydgn.data.dataset.OGBGDatasetInterface(root, name, transform=None, pre_transform=None, pre_filter=None, meta_dict=None)

Bases: ogb.graphproppred.dataset_pyg.PygGraphPropPredDataset

Class that wraps the ogb.graphproppred.PygGraphPropPredDataset class to provide aliases of some fields. It implements the interface DatasetInterface but does not extend directly to avoid clashes of __init__ methods

property dim_edge_features
property dim_node_features
property dim_target
download()

Downloads the dataset to the self.raw_dir folder.

get_idx_split(split_type: Optional[str] = None) dict
process()

Processes the dataset to the self.processed_dir folder.

property processed_file_names

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

class pydgn.data.dataset.PlanetoidDatasetInterface(root, name, transform=None, pre_transform=None, pre_filter=None, **kwargs)

Bases: torch_geometric.datasets.planetoid.Planetoid

Class that wraps the torch_geometric.datasets.Planetoid class to provide aliases of some fields. It implements the interface DatasetInterface but does not extend directly to avoid clashes of __init__ methods

property dim_edge_features
property dim_node_features
property dim_target
download()

Downloads the dataset to the self.raw_dir folder.

process()

Processes the dataset to the self.processed_dir folder.

class pydgn.data.dataset.TUDatasetInterface(root, name, transform=None, pre_transform=None, pre_filter=None, **kwargs)

Bases: torch_geometric.datasets.tu_dataset.TUDataset

Class that wraps the torch_geometric.datasets.TUDataset class to provide aliases of some fields. It implements the interface DatasetInterface but does not extend directly to avoid clashes of __init__ methods

property dim_edge_features
property dim_node_features
property dim_target
download()

Downloads the dataset to the self.raw_dir folder.

process()

Processes the dataset to the self.processed_dir folder.

class pydgn.data.dataset.ZipDataset(datasets: List[torch.utils.data.dataset.Dataset])

Bases: torch.utils.data.dataset.Dataset

This Dataset takes n datasets and “zips” them. When asked for the i-th element, it returns the i-th element of all n datasets.

Parameters

datasets (List[torch.utils.data.Dataset]) – An iterable with PyTorch Datasets

Precondition:

The length of all datasets must be the same

data.provider

class pydgn.data.provider.DataProvider(data_root: str, splits_filepath: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], dataset_name: str, data_loader_class: Union[Callable[[...], torch.utils.data.dataloader.DataLoader], Callable[[...], torch_geometric.loader.dataloader.DataLoader]], data_loader_args: dict, outer_folds: int, inner_folds: int)

Bases: object

A DataProvider object retrieves the correct data according to the external and internal data splits. It can be additionally used to augment the data, or to create a specific type of data loader. The base class does nothing special, but here is where the i-th element of a dataset could be pre-processed before constructing the mini-batches.

Parameters
  • data_root (str) – the path of the root folder in which data is stored

  • splits_filepath (str) – the filepath of the splits. with additional metadata

  • dataset_class (Callable[…,:class:pydgn.data.dataset.DatasetInterface]) – the class of the dataset

  • data_loader_class (Union[Callable[…,:class:torch.utils.data.DataLoader], Callable[…,:class:torch_geometric.loader.DataLoader]]) – the class of the data loader to use

  • data_loader_args (dict) – the arguments of the data loader

  • dataset_name (str) – the name of the dataset

  • outer_folds (int) – the number of outer folds for risk assessment. 1 means hold-out, >1 means k-fold

  • inner_folds (int) – the number of outer folds for model selection. 1 means hold-out, >1 means k-fold

get_dim_edge_features() int

Returns the number of node features of the dataset

Returns

the value of the property dim_edge_features in the dataset

get_dim_node_features() int

Returns the number of node features of the dataset

Returns

the value of the property dim_node_features in the dataset

get_dim_target() int

Returns the dimension of the target for the task

Returns

the value of the property dim_target in the dataset

get_inner_train(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]

Returns the training set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_inner_val(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]

Returns the validation set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_test(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]

Returns the test set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_train(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]

Returns the training set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_val(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]

Returns the validation set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

set_exp_seed(seed: int)

Sets the experiment seed to give to the DataLoader. Helps with reproducibility.

Parameters

seed (int) – id of the seed

set_inner_k(k)

Sets the parameter k of the model selection procedure. Called by the evaluation modules to load the correct subset of the data.

Parameters

k (int) – the id of the fold, ranging from 0 to K-1.

set_outer_k(k: int)

Sets the parameter k of the risk assessment procedure. Called by the evaluation modules to load the correct subset of the data.

Parameters

k (int) – the id of the fold, ranging from 0 to K-1.

class pydgn.data.provider.LinkPredictionSingleGraphDataProvider(data_root: str, splits_filepath: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], dataset_name: str, data_loader_class: Union[Callable[[...], torch.utils.data.dataloader.DataLoader], Callable[[...], torch_geometric.loader.dataloader.DataLoader]], data_loader_args: dict, outer_folds: int, inner_folds: int)

Bases: pydgn.data.provider.DataProvider

An extension of the DataProvider class to deal with link prediction on a single graph. Designed to work with LinkPredictionSingleGraphSplitter. We also assume the single-graph dataset can fit in memory WARNING: this class modifies the dataset by creating copies. It may not work if a “shared dataset” feature is added to PyDGN.

get_inner_train(**kwargs)

Returns the training set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_inner_val(**kwargs)

Returns the validation set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_test(**kwargs)

Returns the test set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_train(**kwargs)

Returns the training set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_val(**kwargs)

Returns the validation set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

pydgn.data.provider.seed_worker(exp_seed, worker_id)

Used to set a different, but reproducible, seed for all data-retriever workers. Without this, all workers will retrieve the data in the same order.

Parameters
  • exp_seed (int) – base seed to be used for reproducibility

  • worker_id (int) – id number of the worker

data.sampler

class pydgn.data.sampler.RandomSampler(data_source: pydgn.data.dataset.DatasetInterface)

Bases: torch.utils.data.sampler.RandomSampler

This sampler wraps the dataset and saves the random permutation applied to the samples, so that it will be available for further use (e.g. for saving graph embeddings in the original order). The permutation is saved in the ‘permutation’ attribute.

Parameters

data_source (pydgn.data.DatasetInterface) – the dataset object

data_source: Sized
replacement: bool

data.splitter

class pydgn.data.splitter.Fold(train_idxs, val_idxs=None, test_idxs=None)

Bases: object

Simple class that stores training, validation, and test indices.

Parameters
  • train_idxs (Union[list, tuple]) – training indices

  • val_idxs (Union[list, tuple]) – validation indices. Default is None

  • test_idxs (Union[list, tuple]) – test indices. Default is None

class pydgn.data.splitter.InnerFold(train_idxs, val_idxs=None, test_idxs=None)

Bases: pydgn.data.splitter.Fold

Simple extension of the Fold class that returns a dictionary with training and validation indices (model selection).

todict() dict

Creates a dictionary with the training/validation indices.

Returns

a dict with keys ['train', 'val'] associated with the respective indices

class pydgn.data.splitter.LinkPredictionSingleGraphSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1, undirected: bool = False, avoid_opposite_negative_edges: bool = True, run_checks=False)

Bases: pydgn.data.splitter.Splitter

Class that inherits from Splitter and computes link splits for link classification tasks. IMPORTANT: This class implements bootstrapping rather than k-fold cross-validation, so different outer test sets may have overlapping indices.

Does not support edge attributes at the moment.

Parameters
  • n_outer_folds (int) – number of outer folds (risk assessment). 1 means hold-out, >1 means k-fold

  • n_inner_folds (int) – number of inner folds (model selection). 1 means hold-out, >1 means k-fold

  • seed (int) – random seed for reproducibility (on the same machine)

  • stratify (bool) – whether to apply stratification or not (should be true for classification tasks)

  • shuffle (bool) – whether to apply shuffle or not

  • inner_val_ratio (float) – percentage of validation set for hold_out model selection. Default is 0.1

  • outer_val_ratio (float) – percentage of validation set for hold_out model assessment (final training runs). Default is 0.1

  • test_ratio (float) – percentage of test set for hold_out model assessment. Default is 0.1

  • undirected (bool) – whether or not the graph is undirected. Default is False

  • avoid_opposite_negative_edges (bool) – whether or not to avoid creating negative edges that are opposite to existing edges Default is True

  • run_checks (bool) – whether or not to run correctness checks. Creates a full adjacency matrix, can be memory intensive.

split(dataset: pydgn.data.dataset.DatasetInterface, targets: Optional[numpy.ndarray] = None)

Computes the splits and stores them in the list fields self.outer_folds and self.inner_folds. Links are selected at random: this means outer test folds will overlap almost surely with if test_ratio is 10% of the total samples. The recommended procedure here is to use the outer folds to do bootstrapping rather than k-fold cross-validation. Idea taken from: https://arxiv.org/pdf/1811.05868.pdf

Parameters
  • dataset (DatasetInterface) – the Dataset object

  • targets (np.ndarray]) – targets used for stratification. Default is None

train_val_test_edge_split(edge_index, edge_attr, val_ratio, test_ratio, num_nodes)

Sample training/validation/test edges at random.

class pydgn.data.splitter.NoShuffleTrainTestSplit(test_ratio)

Bases: object

Class that implements a very simple training/test split. Can be used to further split training data into training and validation.

Parameters

test_ratio – percentage of data to use for evaluation.

split(idxs, y=None)
class pydgn.data.splitter.OGBGSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)

Bases: pydgn.data.splitter.Splitter

split(dataset: pydgn.data.dataset.OGBGDatasetInterface, targets=None)

Computes the splits and stores them in the list fields self.outer_folds and self.inner_folds.

Parameters
  • dataset (DatasetInterface) – the Dataset object

  • targets (np.ndarray]) – targets used for stratification. Default is None

class pydgn.data.splitter.OuterFold(train_idxs, val_idxs=None, test_idxs=None)

Bases: pydgn.data.splitter.Fold

Simple extension of the Fold class that returns a dictionary with training and test indices (risk assessment)

todict() dict

Creates a dictionary with the training/validation/test indices.

Returns

a dict with keys ['train', 'val', 'test'] associated with the respective indices

class pydgn.data.splitter.Splitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)

Bases: object

Class that generates and stores the data splits at dataset creation time.

Parameters
  • n_outer_folds (int) – number of outer folds (risk assessment). 1 means hold-out, >1 means k-fold

  • n_inner_folds (int) – number of inner folds (model selection). 1 means hold-out, >1 means k-fold

  • seed (int) – random seed for reproducibility (on the same machine)

  • stratify (bool) – whether to apply stratification or not (should be true for classification tasks)

  • shuffle (bool) – whether to apply shuffle or not

  • inner_val_ratio (float) – percentage of validation set for hold_out model selection. Default is 0.1

  • outer_val_ratio (float) – percentage of validation set for hold_out model assessment (final training runs). Default is 0.1

  • test_ratio (float) – percentage of test set for hold_out model assessment. Default is 0.1

get_graph_targets(dataset: pydgn.data.dataset.DatasetInterface) -> (<class 'bool'>, <class 'numpy.ndarray'>)

Reads the entire dataset and returns the targets.

Parameters

dataset (DatasetInterface) – the dataset

Returns

a tuple of two elements. The first element is a boolean, which is True if target values exist or an exception has not been thrown. The second value holds the actual targets or None, depending on the first boolean value.

classmethod load(path: str)

Loads the data splits from disk.

:param : param path: the path of the yaml file with the splits

Returns

a Splitter object

save(path: str)

Saves the split as a dictionary into a torch file. The arguments of the dictionary are * seed (int) * splitter_class (str) * splitter_args (dict) * outer_folds (list of dicts) * inner_folds (list of lists of dicts)

Parameters

path (str) – filepath where to save the object

split(dataset: pydgn.data.dataset.DatasetInterface, targets: Optional[numpy.ndarray] = None)

Computes the splits and stores them in the list fields self.outer_folds and self.inner_folds.

Parameters
  • dataset (DatasetInterface) – the Dataset object

  • targets (np.ndarray]) – targets used for stratification. Default is None

pydgn.data.splitter.to_lower_triangular(edge_index: torch.Tensor)

Transform Pytorch Geometric undirected edge index into its “lower triangular counterpart”

data.transform

class pydgn.data.transform.ConstantEdgeIfEmpty(value=1)

Bases: object

Adds a constant value to each edge feature only if edge_attr is None.

Parameters

value (int) – The value to add. Default is 1)

class pydgn.data.transform.ConstantIfEmpty(value=1)

Bases: object

Adds a constant value to each node feature only if x is None.

Parameters

value (int) – The value to add. Default is 1

class pydgn.data.transform.Degree(in_degree: bool = False, cat: bool = True)

Bases: object

Adds the node degree to the node features.

Parameters
  • in_degree (bool) – If set to True, will compute the in-degree of nodes instead of the out-degree.

  • (default (Not relevant if the graph is undirected) – False).

  • cat (bool) – Concat node degrees to node features instead of replacing them. (default: True)

data.util

pydgn.data.util.check_argument(cls: object, arg_name: str) bool

Checks whether arg_name is in the signature of a method or class.

Parameters
  • cls (object) – the class to inspect

  • arg_name (str) – the name to look for

Returns

True if the name was found, False otherwise

pydgn.data.util.filter_adj(edge_index: torch.Tensor, edge_attr: torch.Tensor, mask: torch.Tensor) -> (<class 'torch.Tensor'>, typing.Union[torch.Tensor, NoneType])

Adapted from https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/utils/dropout.html. Does the same thing but with a different signature

Parameters
  • edge_index (torch.Tensor) – the usual PyG matrix of edge indices

  • edge_attr (torch.Tensor) – the usual PyG matrix of edge attributes

  • mask (torch.Tensor) – boolean tensor with edges to filter

Returns

a tuple (filtered edge index, filtered edge attr or None if edge_attr is None)

pydgn.data.util.get_or_create_dir(path: str) str

Creates directories associated to the specified path if they are missing, and it returns the path string.

Parameters

path (str) – the path

Returns

the same path as the given argument

pydgn.data.util.load_dataset(data_root: str, dataset_name: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface]) pydgn.data.dataset.DatasetInterface

Loads the dataset using the dataset_kwargs.pt file created when parsing the data config file.

Parameters
  • data_root (str) – path of the folder that contains the dataset folder

  • dataset_name (str) – name of the dataset (same as the name of the dataset folder that has been already created)

  • dataset_class (Callable[…, DatasetInterface]) – the class of the dataset to instantiate with the parameters stored in the dataset_kwargs.pt file.

Returns

a DatasetInterface object

pydgn.data.util.preprocess_data(options: dict)

One of the main functions of the PyDGN library. Used to create the dataset and its associated files that ensure the correct functioning of the data loading steps.

Parameters

options (dict) – a dictionary of dataset/splitter arguments as defined in the data configuration file used.