grape.ensmallen.datasets.linqs

This sub-module offers methods to automatically retrieve the graphs from LINQS repository.

View Source
"""This sub-module offers methods to automatically retrieve the graphs from LINQS repository."""

from .pubmeddiabetes import PubMedDiabetes
from .cora import Cora
from .citeseer import CiteSeer

__all__ = [
	"PubMedDiabetes", "Cora", "CiteSeer",
]
#   def PubMedDiabetes( directed: bool = False, preprocess: bool = True, load_nodes: bool = True, verbose: int = 2, cache: bool = True, cache_path: str = 'graphs/linqs', version: str = 'latest', **additional_graph_kwargs: Dict ) -> grape.ensmallen.ensmallen.Graph:
View Source
def PubMedDiabetes(
    directed: bool = False,
    preprocess: bool = True,
    load_nodes: bool = True,
    verbose: int = 2,
    cache: bool = True,
    cache_path: str = "graphs/linqs",
    version: str = "latest",
    **additional_graph_kwargs: Dict
) -> Graph:
    """Return new instance of the PubMedDiabetes graph.

    The graph is automatically retrieved from the LINQS repository.	The Pubmed Diabetes dataset consists of 19717 scientific publications from
	PubMed database pertaining to diabetes classified into one of three classes.
	The citation network consists of 44338 links. Each publication in the dataset
	is described by a TF/IDF weighted word vector from a dictionary which consists
	of 500 unique words.

    Parameters
    -------------------
    directed: bool = False
        Wether to load the graph as directed or undirected.
        By default false.
    preprocess: bool = True
        Whether to preprocess the graph to be loaded in 
        optimal time and memory.
    load_nodes: bool = True,
        Whether to load the nodes vocabulary or treat the nodes
        simply as a numeric range.
    verbose: int = 2,
        Wether to show loading bars during the retrieval and building
        of the graph.
    cache: bool = True
        Whether to use cache, i.e. download files only once
        and preprocess them only once.
    cache_path: str = "graphs"
        Where to store the downloaded graphs.
    version: str = "latest"
        The version of the graph to retrieve.	
    additional_graph_kwargs: Dict
        Additional graph kwargs.

    Returns
    -----------------------
    Instace of PubMedDiabetes graph.

	References
	---------------------
	Please cite the following if you use the data:
	
	```bib
	@inproceedings{namata2012query,
	  title={Query-driven active surveying for collective classification},
	  author={Namata, Galileo and London, Ben and Getoor, Lise and Huang, Bert and EDU, UMD},
	  booktitle={10th International Workshop on Mining and Learning with Graphs},
	  volume={8},
	  year={2012}
	}
	```
    """
    return AutomaticallyRetrievedGraph(
        graph_name="PubMedDiabetes",
        repository="linqs",
        version=version,
        directed=directed,
        preprocess=preprocess,
        load_nodes=load_nodes,
        verbose=verbose,
        cache=cache,
        cache_path=cache_path,
        additional_graph_kwargs=additional_graph_kwargs,
		callbacks=[
			parse_linqs_pubmed_incidence_matrix
		],
		callbacks_arguments=[
		    {
		        "cites_path": "Pubmed-Diabetes/Pubmed-Diabetes/data/Pubmed-Diabetes.DIRECTED.cites.tab",
		        "content_path": "Pubmed-Diabetes/Pubmed-Diabetes/data/Pubmed-Diabetes.NODE.paper.tab",
		        "node_path": "nodes.tsv",
		        "edge_path": "edges.tsv"
		    }
		]
    )()

Return new instance of the PubMedDiabetes graph.

The graph is automatically retrieved from the LINQS repository. The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

Parameters
  • directed (bool = False): Wether to load the graph as directed or undirected. By default false.
  • preprocess (bool = True): Whether to preprocess the graph to be loaded in optimal time and memory.
  • load_nodes (bool = True,): Whether to load the nodes vocabulary or treat the nodes simply as a numeric range.
  • verbose (int = 2,): Wether to show loading bars during the retrieval and building of the graph.
  • cache (bool = True): Whether to use cache, i.e. download files only once and preprocess them only once.
  • cache_path (str = "graphs"): Where to store the downloaded graphs.
  • version (str = "latest"): The version of the graph to retrieve.
  • additional_graph_kwargs (Dict): Additional graph kwargs.
Returns

- Instace of PubMedDiabetes graph.: References

Please cite the following if you use the data:

@inproceedings{namata2012query,
  title={Query-driven active surveying for collective classification},
  author={Namata, Galileo and London, Ben and Getoor, Lise and Huang, Bert and EDU, UMD},
  booktitle={10th International Workshop on Mining and Learning with Graphs},
  volume={8},
  year={2012}
}
#   def Cora( directed: bool = False, preprocess: bool = True, load_nodes: bool = True, verbose: int = 2, cache: bool = True, cache_path: str = 'graphs/linqs', version: str = 'latest', **additional_graph_kwargs: Dict ) -> grape.ensmallen.ensmallen.Graph:
View Source
def Cora(
    directed: bool = False,
    preprocess: bool = True,
    load_nodes: bool = True,
    verbose: int = 2,
    cache: bool = True,
    cache_path: str = "graphs/linqs",
    version: str = "latest",
    **additional_graph_kwargs: Dict
) -> Graph:
    """Return new instance of the Cora graph.

    The graph is automatically retrieved from the LINQS repository.	The Cora dataset consists of 2708 scientific publications classified into
	one of seven classes. The citation network consists of 5429 links. Each
	publication in the dataset is described by a 0/1-valued word vector indicating
	the absence/presence of the corresponding word from the dictionary. The
	dictionary consists of 1433 unique words.

    Parameters
    -------------------
    directed: bool = False
        Wether to load the graph as directed or undirected.
        By default false.
    preprocess: bool = True
        Whether to preprocess the graph to be loaded in 
        optimal time and memory.
    load_nodes: bool = True,
        Whether to load the nodes vocabulary or treat the nodes
        simply as a numeric range.
    verbose: int = 2,
        Wether to show loading bars during the retrieval and building
        of the graph.
    cache: bool = True
        Whether to use cache, i.e. download files only once
        and preprocess them only once.
    cache_path: str = "graphs"
        Where to store the downloaded graphs.
    version: str = "latest"
        The version of the graph to retrieve.	
    additional_graph_kwargs: Dict
        Additional graph kwargs.

    Returns
    -----------------------
    Instace of Cora graph.

	References
	---------------------
	Please cite the following if you use the data:
	
	```bib
	@incollection{getoor2005link,
	  title={Link-based classification},
	  author={Getoor, Lise},
	  booktitle={Advanced methods for knowledge discovery from complex data},
	  pages={189--207},
	  year={2005},
	  publisher={Springer}
	}
	
	@article{sen2008collective,
	  title={Collective classification in network data},
	  author={Sen, Prithviraj and Namata, Galileo and Bilgic, Mustafa and Getoor, Lise and Galligher, Brian and Eliassi-Rad, Tina},
	  journal={AI magazine},
	  volume={29},
	  number={3},
	  pages={93--93},
	  year={2008}
	}
	```
    """
    return AutomaticallyRetrievedGraph(
        graph_name="Cora",
        repository="linqs",
        version=version,
        directed=directed,
        preprocess=preprocess,
        load_nodes=load_nodes,
        verbose=verbose,
        cache=cache,
        cache_path=cache_path,
        additional_graph_kwargs=additional_graph_kwargs,
		callbacks=[
			parse_linqs_incidence_matrix
		],
		callbacks_arguments=[
		    {
		        "cites_path": "cora/cora/cora.cites",
		        "content_path": "cora/cora/cora.content",
		        "node_path": "nodes.tsv",
		        "edge_path": "edges.tsv"
		    }
		]
    )()

Return new instance of the Cora graph.

The graph is automatically retrieved from the LINQS repository. The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

Parameters
  • directed (bool = False): Wether to load the graph as directed or undirected. By default false.
  • preprocess (bool = True): Whether to preprocess the graph to be loaded in optimal time and memory.
  • load_nodes (bool = True,): Whether to load the nodes vocabulary or treat the nodes simply as a numeric range.
  • verbose (int = 2,): Wether to show loading bars during the retrieval and building of the graph.
  • cache (bool = True): Whether to use cache, i.e. download files only once and preprocess them only once.
  • cache_path (str = "graphs"): Where to store the downloaded graphs.
  • version (str = "latest"): The version of the graph to retrieve.
  • additional_graph_kwargs (Dict): Additional graph kwargs.
Returns

- Instace of Cora graph.: References

Please cite the following if you use the data:

@incollection{getoor2005link,
  title={Link-based classification},
  author={Getoor, Lise},
  booktitle={Advanced methods for knowledge discovery from complex data},
  pages={189--207},
  year={2005},
  publisher={Springer}
}

@article{sen2008collective,
  title={Collective classification in network data},
  author={Sen, Prithviraj and Namata, Galileo and Bilgic, Mustafa and Getoor, Lise and Galligher, Brian and Eliassi-Rad, Tina},
  journal={AI magazine},
  volume={29},
  number={3},
  pages={93--93},
  year={2008}
}
#   def CiteSeer( directed: bool = False, preprocess: bool = True, load_nodes: bool = True, verbose: int = 2, cache: bool = True, cache_path: str = 'graphs/linqs', version: str = 'latest', **additional_graph_kwargs: Dict ) -> grape.ensmallen.ensmallen.Graph:
View Source
def CiteSeer(
    directed: bool = False,
    preprocess: bool = True,
    load_nodes: bool = True,
    verbose: int = 2,
    cache: bool = True,
    cache_path: str = "graphs/linqs",
    version: str = "latest",
    **additional_graph_kwargs: Dict
) -> Graph:
    """Return new instance of the CiteSeer graph.

    The graph is automatically retrieved from the LINQS repository.	The CiteSeer dataset consists of 3312 scientific publications classified
	into one of six classes. The citation network consists of 4732 links. Each
	publication in the dataset is described by a 0/1-valued word vector indicating
	the absence/presence of the corresponding word from the dictionary. The
	dictionary consists of 3703 unique words.

    Parameters
    -------------------
    directed: bool = False
        Wether to load the graph as directed or undirected.
        By default false.
    preprocess: bool = True
        Whether to preprocess the graph to be loaded in 
        optimal time and memory.
    load_nodes: bool = True,
        Whether to load the nodes vocabulary or treat the nodes
        simply as a numeric range.
    verbose: int = 2,
        Wether to show loading bars during the retrieval and building
        of the graph.
    cache: bool = True
        Whether to use cache, i.e. download files only once
        and preprocess them only once.
    cache_path: str = "graphs"
        Where to store the downloaded graphs.
    version: str = "latest"
        The version of the graph to retrieve.	
    additional_graph_kwargs: Dict
        Additional graph kwargs.

    Returns
    -----------------------
    Instace of CiteSeer graph.

	References
	---------------------
	Please cite the following if you use the data:
	
	```bib
	@incollection{getoor2005link,
	  title={Link-based classification},
	  author={Getoor, Lise},
	  booktitle={Advanced methods for knowledge discovery from complex data},
	  pages={189--207},
	  year={2005},
	  publisher={Springer}
	}
	
	@article{sen2008collective,
	  title={Collective classification in network data},
	  author={Sen, Prithviraj and Namata, Galileo and Bilgic, Mustafa and Getoor, Lise and Galligher, Brian and Eliassi-Rad, Tina},
	  journal={AI magazine},
	  volume={29},
	  number={3},
	  pages={93--93},
	  year={2008}
	}
	```
    """
    return AutomaticallyRetrievedGraph(
        graph_name="CiteSeer",
        repository="linqs",
        version=version,
        directed=directed,
        preprocess=preprocess,
        load_nodes=load_nodes,
        verbose=verbose,
        cache=cache,
        cache_path=cache_path,
        additional_graph_kwargs=additional_graph_kwargs,
		callbacks=[
			parse_linqs_incidence_matrix
		],
		callbacks_arguments=[
		    {
		        "cites_path": "citeseer/citeseer/citeseer.cites",
		        "content_path": "citeseer/citeseer/citeseer.content",
		        "node_path": "nodes.tsv",
		        "edge_path": "edges.tsv"
		    }
		]
    )()

Return new instance of the CiteSeer graph.

The graph is automatically retrieved from the LINQS repository. The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.

Parameters
  • directed (bool = False): Wether to load the graph as directed or undirected. By default false.
  • preprocess (bool = True): Whether to preprocess the graph to be loaded in optimal time and memory.
  • load_nodes (bool = True,): Whether to load the nodes vocabulary or treat the nodes simply as a numeric range.
  • verbose (int = 2,): Wether to show loading bars during the retrieval and building of the graph.
  • cache (bool = True): Whether to use cache, i.e. download files only once and preprocess them only once.
  • cache_path (str = "graphs"): Where to store the downloaded graphs.
  • version (str = "latest"): The version of the graph to retrieve.
  • additional_graph_kwargs (Dict): Additional graph kwargs.
Returns

- Instace of CiteSeer graph.: References

Please cite the following if you use the data:

@incollection{getoor2005link,
  title={Link-based classification},
  author={Getoor, Lise},
  booktitle={Advanced methods for knowledge discovery from complex data},
  pages={189--207},
  year={2005},
  publisher={Springer}
}

@article{sen2008collective,
  title={Collective classification in network data},
  author={Sen, Prithviraj and Namata, Galileo and Bilgic, Mustafa and Getoor, Lise and Galligher, Brian and Eliassi-Rad, Tina},
  journal={AI magazine},
  volume={29},
  number={3},
  pages={93--93},
  year={2008}
}