Module pandas_profiling
Main module of pandas-profiling.
Pandas Profiling
Generates profile reports from a pandas DataFrame
.
The pandas df.describe()
function is great but a little basic for serious exploratory data analysis.
pandas_profiling
extends the pandas DataFrame with df.profile_report()
for quick data analysis.
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
Examples
The following examples can give you an impression of what the package can do:
Installation
Using pip
You can install using the pip package manager by running
pip install pandas-profiling
Using conda
You can install using the conda package manager by running
conda install -c anaconda pandas-profiling
From source
Download the source code by cloning the repository or by pressing 'Download ZIP' on this page. Install by navigating to the proper directory and running
python setup.py install
Usage
The profile report is written in HTML5 and CSS3, which means pandas-profiling requires a modern browser.
Documentation
The documentation for pandas_profiling
can be found here.
The documentaion is generated using pdoc3
.
If you are contribution to this project, you can rebuild the documentation using:
make docs
or on Windows:
make.bat docs
Jupyter Notebook
We recommend generating reports interactively by using the Jupyter notebook.
Start by loading in your pandas DataFrame, e.g. by using
import numpy as np
import pandas as pd
import pandas_profiling
df = pd.DataFrame(
np.random.rand(100, 5),
columns=['a', 'b', 'c', 'd', 'e']
)
To display the report in a Jupyter notebook, run:
df.profile_report()
To retrieve the list of variables which are rejected due to high correlation:
profile = df.profile_report()
rejected_variables = profile.get_rejected_variables(threshold=0.9)
If you want to generate a HTML report file, save the ProfileReport
to an object and use the to_file()
function:
profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="output.html")
Command line usage
For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling
executable. Run
pandas_profiling -h
for information about options and arguments.
Advanced usage
A set of options is available in order to adapt the report generated.
title
(str
): Title for the report ('Pandas Profiling Report' by default).pool_size
(int
): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).minify_html
(boolean
): Whether to minify the output HTML.
More settings can be found in the default configuration file.
Example
profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})
profile.to_file(output_file="output.html")
Dependencies
- python (>= 3.5)
- pandas (>=0.19)
- matplotlib (>=1.4)
- missingno
- confuse
- requests
- jinja2
- numpy
- htmlmin (optional)
- phik (optional)
For development and testing we use additional packages which you can find in the requirements-dev.txt and requirements-test.txt.
Source code
"""Main module of pandas-profiling.
.. include:: ../README.md
"""
import sys
import warnings
from pandas_profiling.utils.dataframe import clean_column_names, rename_index
__version__ = "2.0.1"
from pathlib import Path
import numpy as np
from pandas_profiling.config import config
from pandas_profiling.controller import pandas_decorator
import pandas_profiling.view.templates as templates
from pandas_profiling.model.describe import describe as describe_df
from pandas_profiling.utils.paths import get_config_default
from pandas_profiling.view.report import to_html
class ProfileReport(object):
"""Generate a profile report from a Dataset stored as a pandas `DataFrame`.
Used has is it will output its content as an HTML report in a Jupyter notebook.
"""
html = ""
"""the HTML representation of the report, without the wrapper (containing `<head>` etc.)"""
def __init__(self, df, **kwargs):
config.set_kwargs(kwargs)
# Rename reserved column names
df = rename_index(df)
# Remove spaces and colons from column names
df = clean_column_names(df)
# Sort column names
sort = config["sort"].get(str)
if sys.version_info[1] <= 5 and sort != "None":
warnings.warn("Sorting is supported from Python 3.6+")
if sort in ["asc", "ascending"]:
df = df.reindex(sorted(df.columns, key=lambda s: s.casefold()), axis=1)
elif sort in ["desc", "descending"]:
df = df.reindex(
reversed(sorted(df.columns, key=lambda s: s.casefold())), axis=1
)
elif sort != "None":
raise ValueError('"sort" should be "ascending", "descending" or None.')
# Store column order
config["column_order"] = df.columns.tolist()
# Get dataset statistics
description_set = describe_df(df)
# Get sample
sample = {}
n_head = config["samples"]["head"].get(int)
if n_head > 0:
sample["head"] = df.head(n=n_head)
n_tail = config["samples"]["tail"].get(int)
if n_tail > 0:
sample["tail"] = df.tail(n=n_tail)
# Render HTML
self.html = to_html(sample, description_set)
self.minify_html = config["minify_html"].get(bool)
self.use_local_assets = config["use_local_assets"].get(bool)
self.title = config["title"].get(str)
self.description_set = description_set
self.sample = sample
def get_description(self) -> dict:
"""Return the description (a raw statistical summary) of the dataset.
Returns:
Dict containing a description for each variable in the DataFrame.
"""
return self.description_set
def get_rejected_variables(self, threshold: float = 0.9) -> list:
"""Return a list of variable names being rejected for high
correlation with one of remaining variables.
Args:
threshold: correlation value which is above the threshold are rejected (Default value = 0.9)
Returns:
A list of rejected variables.
"""
variable_profile = self.description_set["variables"]
result = []
for col, values in variable_profile.items():
if "correlation" in values:
if values["correlation"] > threshold:
result.append(col)
return result
def to_file(self, output_file: Path or str) -> None:
"""Write the report to a file.
By default a name is generated.
Args:
output_file: The name or the path of the file to generate including the extension (.html).
"""
if type(output_file) == str:
output_file = Path(output_file)
with output_file.open("w", encoding="utf8") as f:
wrapped_html = self.to_html()
if self.minify_html:
from htmlmin.main import minify
wrapped_html = minify(
wrapped_html, remove_all_empty_space=True, remove_comments=True
)
f.write(wrapped_html)
def to_html(self) -> str:
"""Generate and return complete template as lengthy string
for using with frameworks.
Returns:
Profiling report html including wrapper.
"""
return templates.template("wrapper.html").render(
content=self.html,
title=self.title,
correlation=len(self.description_set["correlations"]) > 0,
missing=len(self.description_set["missing"]) > 0,
sample=len(self.sample) > 0,
version=__version__,
offline=self.use_local_assets,
)
def get_unique_file_name(self):
"""Generate a unique file name."""
return (
"profile_"
+ str(np.random.randint(1000000000, 9999999999, dtype=np.int64))
+ ".html"
)
def _repr_html_(self):
"""Used to output the HTML representation to a Jupyter notebook. This function creates a temporary HTML file
in `./tmp/profile_[hash].html` and returns an Iframe pointing to that contents.
Notes:
This constructions solves problems with conflicting stylesheets and navigation links.
"""
tmp_file = Path("./ipynb_tmp") / self.get_unique_file_name()
tmp_file.parent.mkdir(exist_ok=True)
self.to_file(tmp_file)
from IPython.lib.display import IFrame
from IPython.core.display import display
display(
IFrame(
str(tmp_file),
width=config["notebook"]["iframe"]["width"].get(str),
height=config["notebook"]["iframe"]["height"].get(str),
)
)
def __repr__(self):
"""Override so that Jupyter Notebook does not print the object."""
return ""
Sub-modules
pandas_profiling.config
-
Configuration for the package is handled in this wrapper for confuse.
pandas_profiling.controller
-
The controller module handles all user interaction with the package (console, jupyter, etc.).
pandas_profiling.model
-
The model module handles all logic/calculations, e.g. calculate statistics, testing for special conditions.
pandas_profiling.utils
-
Utility functions for the complete package.
pandas_profiling.view
-
All functionality concerned with presentation to the user.
Classes
class ProfileReport (df, **kwargs)
-
Generate a profile report from a Dataset stored as a pandas
DataFrame
.Used has is it will output its content as an HTML report in a Jupyter notebook.
Source code
class ProfileReport(object): """Generate a profile report from a Dataset stored as a pandas `DataFrame`. Used has is it will output its content as an HTML report in a Jupyter notebook. """ html = "" """the HTML representation of the report, without the wrapper (containing `<head>` etc.)""" def __init__(self, df, **kwargs): config.set_kwargs(kwargs) # Rename reserved column names df = rename_index(df) # Remove spaces and colons from column names df = clean_column_names(df) # Sort column names sort = config["sort"].get(str) if sys.version_info[1] <= 5 and sort != "None": warnings.warn("Sorting is supported from Python 3.6+") if sort in ["asc", "ascending"]: df = df.reindex(sorted(df.columns, key=lambda s: s.casefold()), axis=1) elif sort in ["desc", "descending"]: df = df.reindex( reversed(sorted(df.columns, key=lambda s: s.casefold())), axis=1 ) elif sort != "None": raise ValueError('"sort" should be "ascending", "descending" or None.') # Store column order config["column_order"] = df.columns.tolist() # Get dataset statistics description_set = describe_df(df) # Get sample sample = {} n_head = config["samples"]["head"].get(int) if n_head > 0: sample["head"] = df.head(n=n_head) n_tail = config["samples"]["tail"].get(int) if n_tail > 0: sample["tail"] = df.tail(n=n_tail) # Render HTML self.html = to_html(sample, description_set) self.minify_html = config["minify_html"].get(bool) self.use_local_assets = config["use_local_assets"].get(bool) self.title = config["title"].get(str) self.description_set = description_set self.sample = sample def get_description(self) -> dict: """Return the description (a raw statistical summary) of the dataset. Returns: Dict containing a description for each variable in the DataFrame. """ return self.description_set def get_rejected_variables(self, threshold: float = 0.9) -> list: """Return a list of variable names being rejected for high correlation with one of remaining variables. Args: threshold: correlation value which is above the threshold are rejected (Default value = 0.9) Returns: A list of rejected variables. """ variable_profile = self.description_set["variables"] result = [] for col, values in variable_profile.items(): if "correlation" in values: if values["correlation"] > threshold: result.append(col) return result def to_file(self, output_file: Path or str) -> None: """Write the report to a file. By default a name is generated. Args: output_file: The name or the path of the file to generate including the extension (.html). """ if type(output_file) == str: output_file = Path(output_file) with output_file.open("w", encoding="utf8") as f: wrapped_html = self.to_html() if self.minify_html: from htmlmin.main import minify wrapped_html = minify( wrapped_html, remove_all_empty_space=True, remove_comments=True ) f.write(wrapped_html) def to_html(self) -> str: """Generate and return complete template as lengthy string for using with frameworks. Returns: Profiling report html including wrapper. """ return templates.template("wrapper.html").render( content=self.html, title=self.title, correlation=len(self.description_set["correlations"]) > 0, missing=len(self.description_set["missing"]) > 0, sample=len(self.sample) > 0, version=__version__, offline=self.use_local_assets, ) def get_unique_file_name(self): """Generate a unique file name.""" return ( "profile_" + str(np.random.randint(1000000000, 9999999999, dtype=np.int64)) + ".html" ) def _repr_html_(self): """Used to output the HTML representation to a Jupyter notebook. This function creates a temporary HTML file in `./tmp/profile_[hash].html` and returns an Iframe pointing to that contents. Notes: This constructions solves problems with conflicting stylesheets and navigation links. """ tmp_file = Path("./ipynb_tmp") / self.get_unique_file_name() tmp_file.parent.mkdir(exist_ok=True) self.to_file(tmp_file) from IPython.lib.display import IFrame from IPython.core.display import display display( IFrame( str(tmp_file), width=config["notebook"]["iframe"]["width"].get(str), height=config["notebook"]["iframe"]["height"].get(str), ) ) def __repr__(self): """Override so that Jupyter Notebook does not print the object.""" return ""
Class variables
var html
-
the HTML representation of the report, without the wrapper (containing
<head>
etc.)
Methods
def get_description(self)
-
Return the description (a raw statistical summary) of the dataset.
Returns
Dict containing a description for each variable in the DataFrame.
Source code
def get_description(self) -> dict: """Return the description (a raw statistical summary) of the dataset. Returns: Dict containing a description for each variable in the DataFrame. """ return self.description_set
def get_rejected_variables(self, threshold=0.9)
-
Return a list of variable names being rejected for high correlation with one of remaining variables.
Args
threshold
- correlation value which is above the threshold are rejected (Default value = 0.9)
Returns
A list of rejected variables.
Source code
def get_rejected_variables(self, threshold: float = 0.9) -> list: """Return a list of variable names being rejected for high correlation with one of remaining variables. Args: threshold: correlation value which is above the threshold are rejected (Default value = 0.9) Returns: A list of rejected variables. """ variable_profile = self.description_set["variables"] result = [] for col, values in variable_profile.items(): if "correlation" in values: if values["correlation"] > threshold: result.append(col) return result
def get_unique_file_name(self)
-
Generate a unique file name.
Source code
def get_unique_file_name(self): """Generate a unique file name.""" return ( "profile_" + str(np.random.randint(1000000000, 9999999999, dtype=np.int64)) + ".html" )
def to_file(self, output_file)
-
Write the report to a file.
By default a name is generated.
Args
output_file
- The name or the path of the file to generate including the extension (.html).
Source code
def to_file(self, output_file: Path or str) -> None: """Write the report to a file. By default a name is generated. Args: output_file: The name or the path of the file to generate including the extension (.html). """ if type(output_file) == str: output_file = Path(output_file) with output_file.open("w", encoding="utf8") as f: wrapped_html = self.to_html() if self.minify_html: from htmlmin.main import minify wrapped_html = minify( wrapped_html, remove_all_empty_space=True, remove_comments=True ) f.write(wrapped_html)
def to_html(self)
-
Generate and return complete template as lengthy string for using with frameworks.
Returns
Profiling report html including wrapper.
Source code
def to_html(self) -> str: """Generate and return complete template as lengthy string for using with frameworks. Returns: Profiling report html including wrapper. """ return templates.template("wrapper.html").render( content=self.html, title=self.title, correlation=len(self.description_set["correlations"]) > 0, missing=len(self.description_set["missing"]) > 0, sample=len(self.sample) > 0, version=__version__, offline=self.use_local_assets, )