API References
Compute best fit to your empirical distribution for 89 different theoretical distributions using the Residual Sum of Squares (RSS) estimates.
- class distfit.distfit.BinomPMF(n)
Wrapper so that integer parameters don’t occur as function arguments.
References
Some parts of the binomial fitting is authored by Han-Kwang Nienhuys (2020); copying: CC-BY-SA.
- class distfit.distfit.distfit(method='parametric', alpha=0.05, multtest='fdr_bh', bins=50, bound='both', distr='popular', stats='RSS', smooth=None, n_perm=10000, todf=False, weighted=True, f=1.5)
Probability density function fitting across 89 univariate distributions to non-censored data by scoring statistics such as residual sum of squares (RSS), making plots, and hypothesis testing.
Probability density fitting across 89 univariate distributions to non-censored data by scoring statistics such as Residual Sum of Squares (RSS), and hypothesis testing.
- Parameters
method (str, default: 'parametric') – Specify the method type: ‘parametric’,’quantile’,’percentile’,’discrete’
alpha (float, default: 0.05) – Significance alpha.
multtest (str, default: 'fdr_bh') – None, ‘bonferroni’, ‘sidak’, ‘holm-sidak’, ‘holm’, ‘simes-hochberg’, ‘hommel’, ‘fdr_bh’, ‘fdr_by’, ‘fdr_tsbh’, ‘fdr_tsbky’
bins (int, default: 50) – Bin size to determine the empirical historgram.
bound (str, default: 'both') – Set the directionality to test for significance. Upperbounds = ‘up’, ‘high’ or ‘right’, whereas lowerbounds = ‘down’, ‘low’ or ‘left’
distr (str, default: 'popular') – The (set) of distribution to test. A set of distributions can be tested by: ‘popular’, ‘full’, or specify the theoretical distribution: ‘norm’, ‘t’ or in a list [‘norm’, ‘t’, ..’] if method=”discrete”, then binomial is used. See docs for more information about ‘popular’ and ‘full’: https://erdogant.github.io/distfit/pages/html/Parametric.html
smooth (int, default: None) – Smoothing the histogram can help to get a better fit when there are only few samples available.
stats (str, default: 'RSS') – Specify the scoring statistics: ‘RSS’, ‘wasserstein’, ‘ks’, ‘energy’. ks stands for Kolmogorov-Smirnov statistic
n_perm (int, default: 10000) – Number of permutations to model null-distribution in case of method is “quantile”
weighted (Bool, (default: True)) – Only used in discrete fitting. In principle, the most best fit will be obtained if you set weighted=True. However, using different measures, such as minimum sum of squared errors (SSE) as a metric; you can set weighted=False.
f (float, (default: 1.5)) – Only used in discrete fitting. It uses n in range n0/f to n0*f where n0 is the initial estimate.
- Returns
object.
method (str) – Specified method for fitting and predicting.
alpha (float) – Specified cut-off for P-value significance.
bins (int) – Number of bins specified to create histogram.
bound (str) – Specified testing directionality of the distribution.
distr (str or list of strings) – Specified distribution or a set of distributions.
multtest (str) – Specified multiple test correction method.
todf (Bool (default: False)) – Output results in pandas dataframe when True. Note that creating pandas dataframes makes the code run significantly slower!
Example
>>> from distfit import distfit >>> import numpy as np >>> >>> # Create dataset >>> X = np.random.normal(0, 2, 1000) >>> y = [-8,-6,0,1,2,3,4,5,6] >>> >>> # Set parameters >>> # Default method is set to parameteric models >>> dist = distfit() >>> # In case of quantile >>> dist = distfit(method='quantile') >>> # In case of quantile >>> dist = distfit(method='percentile') >>> # Fit using method >>> model_results = dist.fit_transform(X) >>> dist.plot() >>> >>> # Make prediction >>> results = dist.predict(y) >>> dist.plot()
- fit(verbose=3)
Collect the required distribution functions.
- Parameters
verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.
- Returns
Object.
self.distributions (functions) – list of functions containing distributions.
- fit_transform(X, verbose=3)
Fit best scoring theoretical distribution to the empirical data (X).
- Parameters
X (array-like) – Set of values belonging to the data
verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.
- Returns
dict.
model (dict) – dict containing keys with distribution parameters score : Scoring statistic name : distribution name distr : distribution function params : all kind of parameters loc : loc function parameter scale : scale function parameter arg : arg function parameter
summary (list) – Residual Sum of Squares
histdata (tuple (observed, bins)) – tuple containing observed and bins for data X in the histogram.
size (int) – total number of elements in for data X
- generate(n, random_state=None, verbose=3)
Generate new samples based on the fitted distribution.
- load(filepath, verbose=3)
Load learned model.
- Parameters
filepath (str) – Pathname to stored pickle files.
verbose (int, optional) – Show message. A higher number gives more information. The default is 3.
- Return type
Object.
- plot(title='', figsize=(10, 8), xlim=None, ylim=None, fig=None, ax=None, verbose=3)
Make plot.
- Parameters
title (String, optional (default: '')) – Title of the plot.
figsize (tuple, optional (default: (10,8))) – The figure size.
xlim (Float, optional (default: None)) – Limit figure in x-axis.
ylim (Float, optional (default: None)) – Limit figure in y-axis.
fig (Figure, optional (default: None)) – Matplotlib figure (Note - ignored when method is discrete)
ax (Axes, optional (default: None)) – Matplotlib Axes object (Note - ignored when method is discrete)
verbose (Int [1-5], optional (default: 3)) – Print information to screen.
- Return type
tuple (fig, ax)
- plot_summary(n_top=None, figsize=(15, 8), ylim=None, fig=None, ax=None, verbose=3)
Plot summary results.
- Parameters
n_top (int, optional) – Show the top number of results. The default is None.
figsize (tuple, optional (default: (10,8))) – The figure size.
ylim (Float, optional (default: None)) – Limit figure in y-axis.
fig (Figure, optional (default: None)) – Matplotlib figure
ax (Axes, optional (default: None)) – Matplotlib Axes object
verbose (Int [1-5], optional (default: 3)) – Print information to screen.
- Return type
tuple (fig, ax)
- predict(y, verbose=3)
Compute probability for response variables y, using the specified method.
Computes P-values for [y] based on the fitted distribution from X. The empirical distribution of X is used to estimate the loc/scale/arg parameters for a theoretical distribution in case method type is
parametric
.- Parameters
y (array-like) – Values to be predicted.
model (dict, default : None) – The model created by the .fit() function.
verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.
- Returns
Object.
y_pred (list of str) – prediction of bounds [upper, lower] for input y, using the fitted distribution X.
y_proba (list of float) – probability for response variable y.
df (pd.DataFrame (only when set: todf=True)) – Dataframe containing the predictions in a structed manner.
- save(filepath, overwrite=True, verbose=3)
Save learned model in pickle file.
- Parameters
filepath (str) – Pathname to store pickle files.
verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.
- Return type
object
- transform(X, verbose=3)
Determine best model for input data X.
The input data X can be modellend in two manners:
- parametric
In the parametric case, the best fit on the data is determined using the scoring statistic such as Residual Sum of Squares approach (RSS) for the specified distributions. Based on the best distribution-fit, the confidence intervals (CII) can be determined for later usage in the
predict()
function.- quantile
In the quantile case, the data is ranked and the top/lower quantiles are determined.
- Parameters
X (array-like) – The Null distribution or background data is build from X.
verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.
- Returns
Object.
model (dict) – dict containing keys with distribution parameters score : scoring statistic name : distribution name distr : distribution function params : all kind of parameters loc : loc function parameter scale : scale function parameter arg : arg function parameter
summary (list) – Residual Sum of Squares
histdata (tuple (observed, bins)) – tuple containing observed and bins for data X in the histogram.
size (int) – total number of elements in for data X
- distfit.distfit.fit_binom(X)
Transform array of samples (nonnegative ints) to histogram.
- distfit.distfit.fit_transform_binom(X, f=1.5, weighted=True, stats='RSS', verbose=3)
Convert array of samples (nonnegative ints) to histogram and fit.
- distfit.distfit.plot_binom(self, title='', figsize=(10, 8), xlim=None, ylim=None, verbose=3)
Plot discrete results.
- Parameters
model (dict) – Results derived from the fit_transform function.
- distfit.distfit.smoothline(xs, ys=None, interpol=3, window=1, verbose=3)
Smoothing 1D vector.
Smoothing a 1d vector can be challanging if the number of data is low sampled. This smoothing function therefore contains two steps. First interpolation of the input line followed by a convolution.
- Parameters
xs (array-like) – Data points for the x-axis.
ys (array-like) – Data points for the y-axis.
interpol (int, (default : 3)) – The interpolation factor. The data is interpolation by a factor n before the smoothing step.
window (int, (default : 1)) – Smoothing window that is used to create the convolution and gradually smoothen the line.
verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.
- Returns
xnew (array-like) – Data points for the x-axis.
ynew (array-like) – Data points for the y-axis.
- distfit.distfit.transform_binom(hist, plot=True, weighted=True, f=1.5, stats='RSS', verbose=3)
Fit histogram to binomial distribution.
- Parameters
hist (array-like) – histogram as int array with counts, array index as bin.
weighted (Bool, (default: True)) – In principle, the most best fit will be obtained if you set weighted=True. However, using different measures, such as minimum residual sum of squares (RSS) as a metric; you can set weighted=False.
f (float, (default: 1.5)) – try to fit n in range n0/f to n0*f where n0 is the initial estimate.
- Returns
model (dict) –
- distrObject
fitted binomial model.
- nameString
Name of the fitted distribution.
- RSSfloat
Best RSS score
- nint
binomial n value.
- pfloat
binomial p value.
- chi2rfloat
rchi2: reduced chi-squared. This number should be around 1. Large values indicate a bad fit; small values indicate ‘too good to be true’ data..
figdata (dict) –
- ssesarray-like
The computed RSS scores accompanyin the various n.
- Xdataarray-like
Input data.
- histarray-like
fitted histogram as int array, same length as hist.
- Ydataarray-like
Probability mass function.
- nvalsarray-like
Evaluated n’s.