API References

Compute best fit to your empirical distribution for 89 different theoretical distributions using the Residual Sum of Squares (RSS) estimates.

class distfit.distfit.BinomPMF(n)

Wrapper so that integer parameters don’t occur as function arguments.

References

class distfit.distfit.distfit(method='parametric', alpha=0.05, multtest='fdr_bh', bins=50, bound='both', distr='popular', stats='RSS', smooth=None, n_perm=10000, todf=False, weighted=True, f=1.5)

Probability density function fitting across 89 univariate distributions to non-censored data by scoring statistics such as residual sum of squares (RSS), making plots, and hypothesis testing.

Probability density fitting across 89 univariate distributions to non-censored data by scoring statistics such as Residual Sum of Squares (RSS), and hypothesis testing.

Parameters
  • method (str, default: 'parametric') – Specify the method type: ‘parametric’,’quantile’,’percentile’,’discrete’

  • alpha (float, default: 0.05) – Significance alpha.

  • multtest (str, default: 'fdr_bh') – None, ‘bonferroni’, ‘sidak’, ‘holm-sidak’, ‘holm’, ‘simes-hochberg’, ‘hommel’, ‘fdr_bh’, ‘fdr_by’, ‘fdr_tsbh’, ‘fdr_tsbky’

  • bins (int, default: 50) – Bin size to determine the empirical historgram.

  • bound (str, default: 'both') – Set the directionality to test for significance. Upperbounds = ‘up’, ‘high’ or ‘right’, whereas lowerbounds = ‘down’, ‘low’ or ‘left’

  • distr (str, default: 'popular') – The (set) of distribution to test. A set of distributions can be tested by: ‘popular’, ‘full’, or specify the theoretical distribution: ‘norm’, ‘t’ or in a list [‘norm’, ‘t’, ..’] if method=”discrete”, then binomial is used. See docs for more information about ‘popular’ and ‘full’: https://erdogant.github.io/distfit/pages/html/Parametric.html

  • smooth (int, default: None) – Smoothing the histogram can help to get a better fit when there are only few samples available.

  • stats (str, default: 'RSS') – Specify the scoring statistics: ‘RSS’, ‘wasserstein’, ‘ks’, ‘energy’. ks stands for Kolmogorov-Smirnov statistic

  • n_perm (int, default: 10000) – Number of permutations to model null-distribution in case of method is “quantile”

  • weighted (Bool, (default: True)) – Only used in discrete fitting. In principle, the most best fit will be obtained if you set weighted=True. However, using different measures, such as minimum sum of squared errors (SSE) as a metric; you can set weighted=False.

  • f (float, (default: 1.5)) – Only used in discrete fitting. It uses n in range n0/f to n0*f where n0 is the initial estimate.

Returns

  • object.

  • method (str) – Specified method for fitting and predicting.

  • alpha (float) – Specified cut-off for P-value significance.

  • bins (int) – Number of bins specified to create histogram.

  • bound (str) – Specified testing directionality of the distribution.

  • distr (str or list of strings) – Specified distribution or a set of distributions.

  • multtest (str) – Specified multiple test correction method.

  • todf (Bool (default: False)) – Output results in pandas dataframe when True. Note that creating pandas dataframes makes the code run significantly slower!

Example

>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> # Set parameters
>>> # Default method is set to parameteric models
>>> dist = distfit()
>>> # In case of quantile
>>> dist = distfit(method='quantile')
>>> # In case of quantile
>>> dist = distfit(method='percentile')
>>> # Fit using method
>>> model_results = dist.fit_transform(X)
>>> dist.plot()
>>>
>>> # Make prediction
>>> results = dist.predict(y)
>>> dist.plot()
fit(verbose=3)

Collect the required distribution functions.

Parameters

verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.

Returns

  • Object.

  • self.distributions (functions) – list of functions containing distributions.

fit_transform(X, verbose=3)

Fit best scoring theoretical distribution to the empirical data (X).

Parameters
  • X (array-like) – Set of values belonging to the data

  • verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.

Returns

  • dict.

  • model (dict) – dict containing keys with distribution parameters score : Scoring statistic name : distribution name distr : distribution function params : all kind of parameters loc : loc function parameter scale : scale function parameter arg : arg function parameter

  • summary (list) – Residual Sum of Squares

  • histdata (tuple (observed, bins)) – tuple containing observed and bins for data X in the histogram.

  • size (int) – total number of elements in for data X

generate(n, random_state=None, verbose=3)

Generate new samples based on the fitted distribution.

load(filepath, verbose=3)

Load learned model.

Parameters
  • filepath (str) – Pathname to stored pickle files.

  • verbose (int, optional) – Show message. A higher number gives more information. The default is 3.

Returns

Return type

Object.

plot(title='', figsize=(10, 8), xlim=None, ylim=None, verbose=3)

Make plot.

Parameters
  • title (String, optional (default: '')) – Title of the plot.

  • figsize (tuple, optional (default: (10,8))) – The figure size.

  • xlim (Float, optional (default: None)) – Limit figure in x-axis.

  • ylim (Float, optional (default: None)) – Limit figure in y-axis.

  • verbose (Int [1-5], optional (default: 3)) – Print information to screen.

Returns

Return type

tuple (fig, ax)

plot_summary(n_top=None, figsize=(15, 8), ylim=None, verbose=3)

Plot summary results.

Parameters
  • n_top (int, optional) – Show the top number of results. The default is None.

  • figsize (tuple, optional (default: (10,8))) – The figure size.

  • ylim (Float, optional (default: None)) – Limit figure in y-axis.

  • verbose (Int [1-5], optional (default: 3)) – Print information to screen.

Returns

Return type

tuple (fig, ax)

predict(y, verbose=3)

Compute probability for response variables y, using the specified method.

Computes P-values for [y] based on the fitted distribution from X. The empirical distribution of X is used to estimate the loc/scale/arg parameters for a theoretical distribution in case method type is parametric.

Parameters
  • y (array-like) – Values to be predicted.

  • model (dict, default : None) – The model created by the .fit() function.

  • verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.

Returns

  • Object.

  • y_pred (list of str) – prediction of bounds [upper, lower] for input y, using the fitted distribution X.

  • y_proba (list of float) – probability for response variable y.

  • df (pd.DataFrame (only when set: todf=True)) – Dataframe containing the predictions in a structed manner.

save(filepath, overwrite=True, verbose=3)

Save learned model in pickle file.

Parameters
  • filepath (str) – Pathname to store pickle files.

  • verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.

Returns

Return type

object

transform(X, verbose=3)

Determine best model for input data X.

The input data X can be modellend in two manners:

parametric

In the parametric case, the best fit on the data is determined using the scoring statistic such as Residual Sum of Squares approach (RSS) for the specified distributions. Based on the best distribution-fit, the confidence intervals (CII) can be determined for later usage in the predict() function.

quantile

In the quantile case, the data is ranked and the top/lower quantiles are determined.

Parameters
  • X (array-like) – The Null distribution or background data is build from X.

  • verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.

Returns

  • Object.

  • model (dict) – dict containing keys with distribution parameters score : scoring statistic name : distribution name distr : distribution function params : all kind of parameters loc : loc function parameter scale : scale function parameter arg : arg function parameter

  • summary (list) – Residual Sum of Squares

  • histdata (tuple (observed, bins)) – tuple containing observed and bins for data X in the histogram.

  • size (int) – total number of elements in for data X

distfit.distfit.fit_binom(X)

Transform array of samples (nonnegative ints) to histogram.

distfit.distfit.fit_transform_binom(X, f=1.5, weighted=True, stats='RSS', verbose=3)

Convert array of samples (nonnegative ints) to histogram and fit.

distfit.distfit.plot_binom(self, title='', figsize=(10, 8), xlim=None, ylim=None, verbose=3)

Plot discrete results.

Parameters

model (dict) – Results derived from the fit_transform function.

distfit.distfit.smoothline(xs, ys=None, interpol=3, window=1, verbose=3)

Smoothing 1D vector.

Smoothing a 1d vector can be challanging if the number of data is low sampled. This smoothing function therefore contains two steps. First interpolation of the input line followed by a convolution.

Parameters
  • xs (array-like) – Data points for the x-axis.

  • ys (array-like) – Data points for the y-axis.

  • interpol (int, (default : 3)) – The interpolation factor. The data is interpolation by a factor n before the smoothing step.

  • window (int, (default : 1)) – Smoothing window that is used to create the convolution and gradually smoothen the line.

  • verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.

Returns

  • xnew (array-like) – Data points for the x-axis.

  • ynew (array-like) – Data points for the y-axis.

distfit.distfit.transform_binom(hist, plot=True, weighted=True, f=1.5, stats='RSS', verbose=3)

Fit histogram to binomial distribution.

Parameters
  • hist (array-like) – histogram as int array with counts, array index as bin.

  • weighted (Bool, (default: True)) – In principle, the most best fit will be obtained if you set weighted=True. However, using different measures, such as minimum residual sum of squares (RSS) as a metric; you can set weighted=False.

  • f (float, (default: 1.5)) – try to fit n in range n0/f to n0*f where n0 is the initial estimate.

Returns

  • model (dict) –

    distrObject

    fitted binomial model.

    nameString

    Name of the fitted distribution.

    RSSfloat

    Best RSS score

    nint

    binomial n value.

    pfloat

    binomial p value.

    chi2rfloat

    rchi2: reduced chi-squared. This number should be around 1. Large values indicate a bad fit; small values indicate ‘too good to be true’ data..

  • figdata (dict) –

    ssesarray-like

    The computed RSS scores accompanyin the various n.

    Xdataarray-like

    Input data.

    histarray-like

    fitted histogram as int array, same length as hist.

    Ydataarray-like

    Probability mass function.

    nvalsarray-like

    Evaluated n’s.