Manual for panGraphViewer



Yuxuan Yuan, Ricky Ma and Ting-Fung Chan*

the Chinese University of Hong Kong, Hong Kong

panGraphViewer logo



Version 1.0

2021-08-05

Table of Contents

Versions and dependences

Here we provide two application versions:

Overall, Python3 is needed to run this software and we recommend using miniconda3 to install all python3 libraries.

After the installation of miniconda3, you can follow the steps below to run panGraphViewer.

Desktop-based panGraphViewer

Library installation for the desktop-based version

Steps on different systems

After the steps above, you can install the python3 libraries by typing:

If you use pip, you can install the python3 libraries like:

or you can use pip to install like (need to go to the panGraphViewerApp directory first)

Note:

  1. On Linux or macOS system, pysam is needed. You may install this package using

  2. On Windows platforms, as pysam is not available, we use a windows-version samtools package instead. Additional libraries below are needed and can be installed using

Start the desktop-based version

  1. On Linux or macOS system, you may use the command line below in Terminal to open the software.

  2. On Windows system, you may search and open Anaconda Prompt (miniconda3) first and then move to the panGraphViewer directory. For example, if you have put panGraphViewer on your Desktop and the opened Anaconda Prompt (miniconda3) is in your C drive, you may use the command line below to start the program:

    If you have put panGraphViewer on other drive, you may need to move to the target drive first. For instance, the target drive is D, you can move to the drive by typing D: in Anaconda Prompt (miniconda3) and then move to the panGraphViewer directory to execute panGraphViewerApp.py.

    Please NOTE that on Windows system, you need to use backslash \ rather than the common slash / to move to the target directory.

  3. The logging information will show in Anaconda Prompt (miniconda3) or Terminal depending on the system you use (Will be good for you to monitor the status of the application).


Web-based panGraphViewer

To meet different requirments, we have also created a web-based panGraphViewer. Basically, most functions provided in the Desktop-based version have been implemented in the Web browser-based version. Users can install this version locally or directly deploy this online. The web browser-based verison offers administrative functions to help create accounts for different users.

Library installation for the web-based version

Depending on the systems used, users can use pip directly to install the needed python3 libraries after moving to the panGraphViewerWeb directory.

As mentioned in the desktop-based version, pysam cannot be installed on Windows systems, users need to install alternatives on Windows by using

For Linux or macOS users, pysam can be installed directly using

Start the web-based version

After the installation above, users can move to the panGraphViewerWeb directory by referring to the steps mentioned in the desktop version through Terminal or Anaconda Prompt (miniconda3).

Note that the folder needed here is panGraphViewerWeb.

Once moving to the panGraphViewerWeb directory, users can start the application by typing

or users can use the CMD below to start the Web browser-based version

Once the words Starting development server at http://localhost:8004/ or similar infomation is shown, user can open a browser to open the web-based panGraphViewer.

The admin page is http://localhost:8004/admin and the inital admin info is:

Note: please use the go back button provided by the web browser to move back rather than directly clicking the corresponding functions in the web page to perform analyses.

The Files needed in the application

The rGFA file

  1. If you have multiple high-quality genome assemblies from different individuals, you may use minigraph (Linux preferred) to generate a reference GFA (rGFA) file.

    Before the running, the header of the fasta file needs modifying. For example, if you have a fasta file from Sample1 with a header like:

    You may modify the header to:

    On Linux, the command lines that can be used to achieve this are:

    We also provide a python script renameFastaHeader.py to help this conversion. The script can be found in the scripts folder under panGraphViewer --> panGraphViewerApp. Or users can use the UI to convert by clicking Tools --> Format Conversion --> Modify FASTA Header.

    Please NOTE that:

    I). If you do not modify the header of your fasta file and directly use minigraph to generate the rGFA file, panGraphViewer can still read the file, while many features, such as where the node comes from would not show in detail. A warning message will display in both UI and the opened Terminal or powershell.

    II). For the sample name, please DO NOT include ||.

  2. If you don't have an rGFA file, but a GFA file, you may try to follow the standard here to convert your GFA file into an rGFA file. After generating an rGFA file, you can use this software to visualise the graph of interest.

The VCF file

We also accept a VCF file to show the graph. Basically, a reference FASTA file is optional if the VCF is a standard one. The program will automatically check the input VCF file and evaluate if the VCF file meets the requirement. If not, a message will show.

VCF filtration is highly recommended before plot the graph.

We also provide a method to help convert a VCF file to an rGFA file. Users can perform the conversion directly through the interface provided in the application or directly use vcf2rGFA.py under the panGraphViewer --> panGraphViewerApp --> scripts folder.

Note: If there are many variations in the VCF file, we recommend using vcf2rGFA.py directly to convert by chromosomes rather than converting entirely. This will save a lot of computing resource when plot graphs.

The usage of vcf2rGFA.py is shown below. Both Windows and Linux/macOS users can directly use this script to convert a VCF file to an rGFA file.

The BED file

Basically, the BED file should contain the annoation information from the backbone sample. There should be at least 6 columns in the BED file.

ColumnInformation
1Chromosome ID
2Gene start position
3Gene end position
4Gene ID
5Score (or others; the program does not use the info in this column)
6Orientation

Users can load the BED file to check the overlaps between variations and genes. By default, genes overlapping with more than 2 nodes will be shown in the dropdown menu. A gene list will be saved in the output directory after parsing the BED file.


Q&A:

The minimum computing resource needed

The minimum computing resource needed for running the application


Which application should I use

For the desktop-based application, it is optimized on Windows 10 and macOS Big Sur. Ubuntu 18.04.5 is also tested. For Linux operating system version below Ubuntu 18.04.5 or equivalent, such as Ubuntu 16.04, PyQtWebEngine may not work properly. For other versions of operating systems, the desktop-based application may still work, however, the layout of the application may differ.

For the web browser-based version, we suggest running in Linux or macOS environment. If users want to run on Windows systems, Windows 10 or above is recommended. Users can also use docker to run the web browser-based version. However, WSL is needed to run the docker version on Windows 10 or above.


The backbone sample

The backbone sample is the one used as the main sequence provider to produce the pangenome graph or the reference sample to produce the VCF file. In the pangenome graph, most of the nodes are from the backbone sample (shared by all) with some nodes (variations) from other samples.


The colors showed in the graph

Each sample uses one particular colour and the most frequent colour should be the one used for the backbone sample. The colours are randomly selected by the program from a desgined colour palletes.


The type of graphs

We provides two kinds of graph plots in the program to achieve a good performance and visualisation. By default, if the number of checked nodes <= 200, vis.js based graph will show. Otherwise, a cytoscape.js based graph will show. Users can change the settings in the desktop-based application.


The shapes showed in the graph

If you use a VCF file to show graphs, we use different nodes shapes to represent different kinds of variants. For instance, in the default settings for the vis.js based graph, dot represent SNP, triangle represents deletion, triangleDown reprsents insertion, database represents duplication, text shows inversion and star represent translocation. Users can change the corresponding settings to select preferred node shapes to represent different variations on the desktop-based application.


How to use the program

For the desktop-based version, once the application is open as shown below, users can use the following steps to explore the program.

application

For the web-based version, the login interface is like:

application-web

Basically, the program reads rGFA file, VCF file and BED file.

  1. If an rGFA file is available, users can browse the system to import the rGFA file directly.

  2. If an rGFA file is not available but a VCF file is available, users can import the VCF directly.

    Please NOTE that the default settings for Threads is 4. Users can change 4 to any integer >=1 depending on the threads that the system can provide. The name of backbone and the backbone FASTA is optional. If they are not given, the program will automatically check and assign a name (backbone by default).

VCF

  1. After importing the file(s) and specifying an output directory, users can click Start button. The progrom will run internally with 'Parsing... or Converting ...' showing in the Status bar. Once this is completed, 'Finished in xxx s!' will show.

  2. Now users can select the name of the backbone sample and the chromosome id that wants to check. By giving a coordinate (start and end positions), a graph will be ready to show. Here the coordinate is optional. If start and end positions are missing, the graph of the selected whole chromosome will show. If any of the start and end position is missing, the program will handle this automatically.

  3. Users can change the shape of nodes and modify the display of graph by changing the corresponding Settings on the top left panel of the GUI.

  4. In the Sample(s) showing panel, users can remove or add particular samples which will be shown or hidden in the graph. The backbone sample cannot be removed.

  5. Once all settings are completed, users can click the plot button to check the graph. Simiarly, the running and completed information will show in the Status bar under the Plot the Graph panel. The graph will show in the canvas and the graph can be zoomed in and zoomed out. By moving the mouse, the information of each node can show.

    mouseover

  6. If Cytoscape graph is shown, users can press CTRL or command and hold the left click of the mouse to select particular node regions and right click the mouse to show or hide nodes.

    mSelection

  7. Users can check the sequences of particlar nodes by selecting through the node combobox or directly input in the textbox. Please Note that each line can only input one node id.

  8. When checking the overlap between genes and variation regions, users can import a BED file in the Check Overlap with Genes panel. After parsing the BED file, genes overlapping with at least 2 nodes will in the dropdown menu. Users can select gene of interest to check overlaps. In the canvas, a graph will show. Users can enable the zoom-in and zoom-out function by clicking the Wheel Zoom(x-axis) button on the right top panel in the canvas.

    gene

  9. Users can explore other settings to get a preferred graph. Screenshot function is also provided. On Windows systems, users can press ALT+P to start screen clipping. By holding the left click button of the mouse to select the regions and double left click to save the image.


Different variations

If users use a VCF file to generate a graph genome, when moving the mouse to the graph node, the program will automatically show the variation types, such as SNP(single nucleotide polymorphism), INS (insertion), INV (inversion) and DUP (duplication). The corresponding nodes from the backbone sample will also be linked and shown.

variant

Enjoy using panGraphViewer!