pydgn.data
data.dataset
- class pydgn.data.dataset.ConcatFromListDataset(data_list: List[torch_geometric.data.data.Data])
Bases:
torch_geometric.data.in_memory_dataset.InMemoryDataset
Create a dataset from a list of
torch_geometric.data.Data
objects. Inherits fromtorch_geometric.data.InMemoryDataset
- Parameters
data_list (list) – List of graphs.
- download()
Downloads the dataset to the
self.raw_dir
folder.
- process()
Processes the dataset to the
self.processed_dir
folder.
- property processed_file_names: Union[str, List[str], Tuple]
The name of the files in the
self.processed_dir
folder that must be present in order to skip processing.
- property raw_file_names: Union[str, List[str], Tuple]
The name of the files in the
self.raw_dir
folder that must be present in order to skip downloading.
- class pydgn.data.dataset.DatasetInterface(root: str, name: str, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, pre_filter: Optional[Callable] = None)
Bases:
torch_geometric.data.dataset.Dataset
Class that defines a number of properties essential to all datasets implementations inside PyDGN. These properties are used by the training engine and forwarded to the model to be trained. For some datasets, e.g.,
torch_geometric.datasets.TUDataset
, implementing this interface is straightforward.- Parameters
root (str) – root folder where to store the dataset
name (str) – name of the dataset
transform (Optional[Callable]) – transformations to apply to each
Data
object at run timepre_transform (Optional[Callable]) – transformations to apply to each
Data
object at dataset creation timepre_filter (Optional[Callable]) – sample filtering to apply to each
Data
object at dataset creation time
- property dim_edge_features
Specifies the number of edge features (after pre-processing, but in the end it depends on the model that is implemented).
- property dim_node_features
Specifies the number of node features (after pre-processing, but in the end it depends on the model that is implemented).
- property dim_target
Specifies the dimension of each target vector.
- download()
Downloads the dataset to the
self.raw_dir
folder.
- get(idx: int) torch_geometric.data.data.Data
Gets the data object at index
idx
.
- len() int
Returns the number of graphs stored in the dataset.
- process()
Processes the dataset to the
self.processed_dir
folder.
- property processed_file_names: Union[str, List[str], Tuple]
The name of the files in the
self.processed_dir
folder that must be present in order to skip processing.
- property raw_file_names: Union[str, List[str], Tuple]
The name of the files in the
self.raw_dir
folder that must be present in order to skip downloading.
- class pydgn.data.dataset.OGBGDatasetInterface(root, name, transform=None, pre_transform=None, pre_filter=None, meta_dict=None)
Bases:
ogb.graphproppred.dataset_pyg.PygGraphPropPredDataset
Class that wraps the
ogb.graphproppred.PygGraphPropPredDataset
class to provide aliases of some fields. It implements the interfaceDatasetInterface
but does not extend directly to avoid clashes of__init__
methods- property dim_edge_features
- property dim_node_features
- property dim_target
- download()
Downloads the dataset to the
self.raw_dir
folder.
- get_idx_split(split_type: Optional[str] = None) dict
- process()
Processes the dataset to the
self.processed_dir
folder.
- property processed_file_names
The name of the files in the
self.processed_dir
folder that must be present in order to skip processing.
- class pydgn.data.dataset.PlanetoidDatasetInterface(root, name, transform=None, pre_transform=None, pre_filter=None, **kwargs)
Bases:
torch_geometric.datasets.planetoid.Planetoid
Class that wraps the
torch_geometric.datasets.Planetoid
class to provide aliases of some fields. It implements the interfaceDatasetInterface
but does not extend directly to avoid clashes of__init__
methods- property dim_edge_features
- property dim_node_features
- property dim_target
- download()
Downloads the dataset to the
self.raw_dir
folder.
- process()
Processes the dataset to the
self.processed_dir
folder.
- class pydgn.data.dataset.TUDatasetInterface(root, name, transform=None, pre_transform=None, pre_filter=None, **kwargs)
Bases:
torch_geometric.datasets.tu_dataset.TUDataset
Class that wraps the
torch_geometric.datasets.TUDataset
class to provide aliases of some fields. It implements the interfaceDatasetInterface
but does not extend directly to avoid clashes of__init__
methods- property dim_edge_features
- property dim_node_features
- property dim_target
- download()
Downloads the dataset to the
self.raw_dir
folder.
- process()
Processes the dataset to the
self.processed_dir
folder.
- class pydgn.data.dataset.ZipDataset(datasets: List[torch.utils.data.dataset.Dataset])
Bases:
torch.utils.data.dataset.Dataset
This Dataset takes n datasets and “zips” them. When asked for the i-th element, it returns the i-th element of all n datasets.
- Parameters
datasets (List[torch.utils.data.Dataset]) – An iterable with PyTorch Datasets
- Precondition:
The length of all datasets must be the same
data.provider
- class pydgn.data.provider.DataProvider(data_root: str, splits_filepath: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], dataset_name: str, data_loader_class: Union[Callable[[...], torch.utils.data.dataloader.DataLoader], Callable[[...], torch_geometric.loader.dataloader.DataLoader]], data_loader_args: dict, outer_folds: int, inner_folds: int)
Bases:
object
A DataProvider object retrieves the correct data according to the external and internal data splits. It can be additionally used to augment the data, or to create a specific type of data loader. The base class does nothing special, but here is where the i-th element of a dataset could be pre-processed before constructing the mini-batches.
- Parameters
data_root (str) – the path of the root folder in which data is stored
splits_filepath (str) – the filepath of the splits. with additional metadata
dataset_class (Callable[…,:class:pydgn.data.dataset.DatasetInterface]) – the class of the dataset
data_loader_class (Union[Callable[…,:class:torch.utils.data.DataLoader], Callable[…,:class:torch_geometric.loader.DataLoader]]) – the class of the data loader to use
data_loader_args (dict) – the arguments of the data loader
dataset_name (str) – the name of the dataset
outer_folds (int) – the number of outer folds for risk assessment. 1 means hold-out, >1 means k-fold
inner_folds (int) – the number of outer folds for model selection. 1 means hold-out, >1 means k-fold
- get_dim_edge_features() int
Returns the number of node features of the dataset
- Returns
the value of the property
dim_edge_features
in the dataset
- get_dim_node_features() int
Returns the number of node features of the dataset
- Returns
the value of the property
dim_node_features
in the dataset
- get_dim_target() int
Returns the dimension of the target for the task
- Returns
the value of the property
dim_target
in the dataset
- get_inner_train(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]
Returns the training set for model selection associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- get_inner_val(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]
Returns the validation set for model selection associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- get_outer_test(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]
Returns the test set for risk assessment associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- get_outer_train(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]
Returns the training set for risk assessment associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- get_outer_val(**kwargs: dict) Union[torch.utils.data.dataloader.DataLoader, torch_geometric.loader.dataloader.DataLoader]
Returns the validation set for risk assessment associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- set_exp_seed(seed: int)
Sets the experiment seed to give to the DataLoader. Helps with reproducibility.
- Parameters
seed (int) – id of the seed
- set_inner_k(k)
Sets the parameter k of the model selection procedure. Called by the evaluation modules to load the correct subset of the data.
- Parameters
k (int) – the id of the fold, ranging from 0 to K-1.
- set_outer_k(k: int)
Sets the parameter k of the risk assessment procedure. Called by the evaluation modules to load the correct subset of the data.
- Parameters
k (int) – the id of the fold, ranging from 0 to K-1.
- class pydgn.data.provider.LinkPredictionSingleGraphDataProvider(data_root: str, splits_filepath: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], dataset_name: str, data_loader_class: Union[Callable[[...], torch.utils.data.dataloader.DataLoader], Callable[[...], torch_geometric.loader.dataloader.DataLoader]], data_loader_args: dict, outer_folds: int, inner_folds: int)
Bases:
pydgn.data.provider.DataProvider
An extension of the DataProvider class to deal with link prediction on a single graph. Designed to work with
LinkPredictionSingleGraphSplitter
. We also assume the single-graph dataset can fit in memory WARNING: this class modifies the dataset by creating copies. It may not work if a “shared dataset” feature is added to PyDGN.- get_inner_train(**kwargs)
Returns the training set for model selection associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- get_inner_val(**kwargs)
Returns the validation set for model selection associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- get_outer_test(**kwargs)
Returns the test set for risk assessment associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- get_outer_train(**kwargs)
Returns the training set for risk assessment associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- get_outer_val(**kwargs)
Returns the validation set for risk assessment associated with specific outer and inner folds
- Parameters
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns
a Union[
torch.utils.data.DataLoader
,torch_geometric.loader.DataLoader
] object
- pydgn.data.provider.seed_worker(exp_seed, worker_id)
Used to set a different, but reproducible, seed for all data-retriever workers. Without this, all workers will retrieve the data in the same order.
- Parameters
exp_seed (int) – base seed to be used for reproducibility
worker_id (int) – id number of the worker
data.sampler
- class pydgn.data.sampler.RandomSampler(data_source: pydgn.data.dataset.DatasetInterface)
Bases:
torch.utils.data.sampler.RandomSampler
This sampler wraps the dataset and saves the random permutation applied to the samples, so that it will be available for further use (e.g. for saving graph embeddings in the original order). The permutation is saved in the ‘permutation’ attribute.
- Parameters
data_source (
pydgn.data.DatasetInterface
) – the dataset object
- data_source: Sized
- replacement: bool
data.splitter
- class pydgn.data.splitter.Fold(train_idxs, val_idxs=None, test_idxs=None)
Bases:
object
Simple class that stores training, validation, and test indices.
- Parameters
train_idxs (Union[list, tuple]) – training indices
val_idxs (Union[list, tuple]) – validation indices. Default is
None
test_idxs (Union[list, tuple]) – test indices. Default is
None
- class pydgn.data.splitter.InnerFold(train_idxs, val_idxs=None, test_idxs=None)
Bases:
pydgn.data.splitter.Fold
Simple extension of the Fold class that returns a dictionary with training and validation indices (model selection).
- todict() dict
Creates a dictionary with the training/validation indices.
- Returns
a dict with keys
['train', 'val']
associated with the respective indices
- class pydgn.data.splitter.LinkPredictionSingleGraphSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1, undirected: bool = False, avoid_opposite_negative_edges: bool = True, run_checks=False)
Bases:
pydgn.data.splitter.Splitter
Class that inherits from
Splitter
and computes link splits for link classification tasks. IMPORTANT: This class implements bootstrapping rather than k-fold cross-validation, so different outer test sets may have overlapping indices.Does not support edge attributes at the moment.
- Parameters
n_outer_folds (int) – number of outer folds (risk assessment). 1 means hold-out, >1 means k-fold
n_inner_folds (int) – number of inner folds (model selection). 1 means hold-out, >1 means k-fold
seed (int) – random seed for reproducibility (on the same machine)
stratify (bool) – whether to apply stratification or not (should be true for classification tasks)
shuffle (bool) – whether to apply shuffle or not
inner_val_ratio (float) – percentage of validation set for hold_out model selection. Default is
0.1
outer_val_ratio (float) – percentage of validation set for hold_out model assessment (final training runs). Default is
0.1
test_ratio (float) – percentage of test set for hold_out model assessment. Default is
0.1
undirected (bool) – whether or not the graph is undirected. Default is
False
avoid_opposite_negative_edges (bool) – whether or not to avoid creating negative edges that are opposite to existing edges Default is
True
run_checks (bool) – whether or not to run correctness checks. Creates a full adjacency matrix, can be memory intensive.
- split(dataset: pydgn.data.dataset.DatasetInterface, targets: Optional[numpy.ndarray] = None)
Computes the splits and stores them in the list fields
self.outer_folds
andself.inner_folds
. Links are selected at random: this means outer test folds will overlap almost surely with if test_ratio is 10% of the total samples. The recommended procedure here is to use the outer folds to do bootstrapping rather than k-fold cross-validation. Idea taken from: https://arxiv.org/pdf/1811.05868.pdf- Parameters
dataset (
DatasetInterface
) – the Dataset objecttargets (np.ndarray]) – targets used for stratification. Default is
None
- train_val_test_edge_split(edge_index, edge_attr, val_ratio, test_ratio, num_nodes)
Sample training/validation/test edges at random.
- class pydgn.data.splitter.NoShuffleTrainTestSplit(test_ratio)
Bases:
object
Class that implements a very simple training/test split. Can be used to further split training data into training and validation.
- Parameters
test_ratio – percentage of data to use for evaluation.
- split(idxs, y=None)
- class pydgn.data.splitter.OGBGSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)
Bases:
pydgn.data.splitter.Splitter
- split(dataset: pydgn.data.dataset.OGBGDatasetInterface, targets=None)
Computes the splits and stores them in the list fields
self.outer_folds
andself.inner_folds
.- Parameters
dataset (
DatasetInterface
) – the Dataset objecttargets (np.ndarray]) – targets used for stratification. Default is
None
- class pydgn.data.splitter.OuterFold(train_idxs, val_idxs=None, test_idxs=None)
Bases:
pydgn.data.splitter.Fold
Simple extension of the Fold class that returns a dictionary with training and test indices (risk assessment)
- todict() dict
Creates a dictionary with the training/validation/test indices.
- Returns
a dict with keys
['train', 'val', 'test']
associated with the respective indices
- class pydgn.data.splitter.Splitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)
Bases:
object
Class that generates and stores the data splits at dataset creation time.
- Parameters
n_outer_folds (int) – number of outer folds (risk assessment). 1 means hold-out, >1 means k-fold
n_inner_folds (int) – number of inner folds (model selection). 1 means hold-out, >1 means k-fold
seed (int) – random seed for reproducibility (on the same machine)
stratify (bool) – whether to apply stratification or not (should be true for classification tasks)
shuffle (bool) – whether to apply shuffle or not
inner_val_ratio (float) – percentage of validation set for hold_out model selection. Default is
0.1
outer_val_ratio (float) – percentage of validation set for hold_out model assessment (final training runs). Default is
0.1
test_ratio (float) – percentage of test set for hold_out model assessment. Default is
0.1
- get_graph_targets(dataset: pydgn.data.dataset.DatasetInterface) -> (<class 'bool'>, <class 'numpy.ndarray'>)
Reads the entire dataset and returns the targets.
- Parameters
dataset (
DatasetInterface
) – the dataset- Returns
a tuple of two elements. The first element is a boolean, which is
True
if target values exist or an exception has not been thrown. The second value holds the actual targets orNone
, depending on the first boolean value.
- classmethod load(path: str)
Loads the data splits from disk.
:param : param path: the path of the yaml file with the splits
- Returns
a
Splitter
object
- save(path: str)
Saves the split as a dictionary into a
torch
file. The arguments of the dictionary are * seed (int) * splitter_class (str) * splitter_args (dict) * outer_folds (list of dicts) * inner_folds (list of lists of dicts)- Parameters
path (str) – filepath where to save the object
- split(dataset: pydgn.data.dataset.DatasetInterface, targets: Optional[numpy.ndarray] = None)
Computes the splits and stores them in the list fields
self.outer_folds
andself.inner_folds
.- Parameters
dataset (
DatasetInterface
) – the Dataset objecttargets (np.ndarray]) – targets used for stratification. Default is
None
- pydgn.data.splitter.to_lower_triangular(edge_index: torch.Tensor)
Transform Pytorch Geometric undirected edge index into its “lower triangular counterpart”
data.transform
- class pydgn.data.transform.ConstantEdgeIfEmpty(value=1)
Bases:
object
Adds a constant value to each edge feature only if edge_attr is None.
- Parameters
value (int) – The value to add. Default is
1
)
- class pydgn.data.transform.ConstantIfEmpty(value=1)
Bases:
object
Adds a constant value to each node feature only if x is None.
- Parameters
value (int) – The value to add. Default is
1
- class pydgn.data.transform.Degree(in_degree: bool = False, cat: bool = True)
Bases:
object
Adds the node degree to the node features.
- Parameters
in_degree (bool) – If set to
True
, will compute the in-degree of nodes instead of the out-degree.(default (Not relevant if the graph is undirected) –
False
).cat (bool) – Concat node degrees to node features instead of replacing them. (default:
True
)
data.util
- pydgn.data.util.check_argument(cls: object, arg_name: str) bool
Checks whether
arg_name
is in the signature of a method or class.- Parameters
cls (object) – the class to inspect
arg_name (str) – the name to look for
- Returns
True
if the name was found,False
otherwise
- pydgn.data.util.filter_adj(edge_index: torch.Tensor, edge_attr: torch.Tensor, mask: torch.Tensor) -> (<class 'torch.Tensor'>, typing.Union[torch.Tensor, NoneType])
Adapted from https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/utils/dropout.html. Does the same thing but with a different signature
- Parameters
edge_index (torch.Tensor) – the usual PyG matrix of edge indices
edge_attr (torch.Tensor) – the usual PyG matrix of edge attributes
mask (torch.Tensor) – boolean tensor with edges to filter
- Returns
a tuple (filtered edge index, filtered edge attr or
None
ifedge_attr
isNone
)
- pydgn.data.util.get_or_create_dir(path: str) str
Creates directories associated to the specified path if they are missing, and it returns the path string.
- Parameters
path (str) – the path
- Returns
the same path as the given argument
- pydgn.data.util.load_dataset(data_root: str, dataset_name: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface]) pydgn.data.dataset.DatasetInterface
Loads the dataset using the
dataset_kwargs.pt
file created when parsing the data config file.- Parameters
data_root (str) – path of the folder that contains the dataset folder
dataset_name (str) – name of the dataset (same as the name of the dataset folder that has been already created)
dataset_class (Callable[…,
DatasetInterface
]) – the class of the dataset to instantiate with the parameters stored in thedataset_kwargs.pt
file.
- Returns
a
DatasetInterface
object
- pydgn.data.util.preprocess_data(options: dict)
One of the main functions of the PyDGN library. Used to create the dataset and its associated files that ensure the correct functioning of the data loading steps.
- Parameters
options (dict) – a dictionary of dataset/splitter arguments as defined in the data configuration file used.